Tech Lead, Machine Learning Engineer - Global E-Commerce (Conversational AI)

Byte Dance

4-6 Years

Singapore

Early Applicant

Posted a month ago
Be among the first 10 applicants

Job Description

Responsibilities

About the team We are building the next generation of conversational AI for Global E-commerce - a unified Agent system that learns from every interaction, runs in 30+ languages, and is deployed across one of the largest e-commerce surfaces on the internet. Our 2026 north star is a self-evolving Agent: post-training, harness, tools, memory, and evaluation form one closed loop, and every served conversation becomes training, evaluation, and retrieval signal for the next iteration. Business surface - buyer, seller, dispute & appeals, operations - is the substrate. Our work is foundational LLM + Agent engineering: post-training, agent harness, tool design, memory, evaluation, inference, multilinguality. We are hiring people who want to push the SOTA of these systems in production, at scale, with hundreds of millions of users in the loop. What we work on - LLM post-training & alignment - large-scale SFT, DPO/IPO/KTO, online RL (RLHF / RLAIF / RLVR), reward modeling, preference data curation, long-context training, distillation, QAT. We train and adapt frontier-class open-weights models (7B 70B) and our own continually-pretrained checkpoints on internal infra (FSDP / DeepSpeed / Megatron-style stacks). - Agent foundations - harness design (context engineering, sub-agents, durable execution, parallel tool use), tool design (ACI principles, namespaced surfaces, poka-yoke, instrumented traces), memory (episodic + semantic + skill-shaped), MCP and Skill-style extensibility. We treat tools and prompts as APIs and iterate against production traces. - Auto-eval and observability - LLM-as-judge with calibrated human agreement, real-traffic replay, failure-mode taxonomies, regression + safety + cost + latency harnesses. We have moved root-cause analysis on a single case from 13 engineer-days to 3 minutes auto. - Self-evolving systems - every served conversation becomes a candidate for training data, eval set membership, retrieval index, and skill induction, with privacy and quality gates. The flywheel is the product. - Inference & serving - vLLM / TensorRT-LLM, MoE, speculative decoding, KV-cache reuse and prompt caching, multi-tenant low-latency serving. Cost per resolved conversation is a first-class metric. - Multilinguality & locale grounding - 30+ languages, low-resource adaptation, faithful translation, locale-aware reasoning, cross-cultural tone. - Reasoning & long-context modeling - chain-of-thought / planning post-training, reasoning-trace supervision, long-context training and serving, retrieval-augmented reasoning, self-consistency and verifier models. Responsibilities - Set technical direction. Own a multi-quarter roadmap across one or more of: post-training, agent harness, evaluation, self-evolving data flywheel, serving. Translate north-star metrics into a sequence of 2-3 high-ROI bets per quarter and ship them. - Compound the team. Hire and develop 1-3 strong ICs. Design their work surfaces for growth, not just dispatch. Raise the median technical bar through design review, code review, and 1:1 framing. - Stay in the loop with the model. Tech Lead is not a manager role. You still write the load-bearing PRs, propose the core abstractions, and write the design docs that decide the team's ceiling for the next 2-3 quarters. - Drive cross-team alignment. Partner with foundation-model, infra, product, and adjacent algorithm teams own sign-off on cross-cutting technical decisions. - Observability and rollback. Build the per-turn tracing, tool-call analytics, and failure-mode taxonomies that let the team diagnose any regression within hours, not days.

Qualifications

Minimum Qualifications - BS / MS / PhD in CS, AI, Mathematics, or related quantitative field. - Hands-on experience in ML / NLP / applied DL. Top PhDs with strong publication record may qualify at 4+ years. - Strong Python and at least one of C++ / Go / Rust for production-path code. - Hands-on post-training or fine-tuning of frontier-class LLMs (7B, multi-node). Not API-only. - Has led at least one production LLM / Agent system from zero to one. Preferred Qualifications - LLM post-training - multi-node SFT / DPO / online RL on 7B models reward modeling preference data construction RLAIF / RLVR distillation QAT long-context training continual pre-training. - Agent engineering (Anthropic-style) - production agent harness, context engineering, sub-agents, durable execution, MCP / Skill-style extensibility, parallel tool use, computer use. - Reasoning & planning - chain-of-thought / reasoning-trace training, planner/critic decomposition, self-consistency, verifier models, multi-step reasoning evaluation. - LLM / Agent evaluation - LLM-as-judge with human-agreement calibration, tau-bench / SWE-bench / GAIA / BFCL-style harnesses, regression + safety + cost + latency-aware evaluation. - Inference & serving systems - vLLM / TensorRT-LLM, MoE, speculative decoding, KV-cache and prompt caching, low-latency multi-tenant serving. - Multilingual & cross-cultural reasoning - multilingual SFT/DPO, low-resource adaptation, faithful MT, locale-aware reasoning.

More Info

Job Type:

Permanent Job

Industry:

IT /Computers - Software

Function:

Ai/Ml

Employment Type:

Full time

About Company

Byte Dance

ByteDance is a technology company operating a range of content platforms that inform, educate, entertain and inspire people across languages, cultures, and geographies.
Dedicated to building global platforms of creation and interaction, ByteDance now has a portfolio of applications available in over 150 markets and 75 languages. For example, TikTok, Helo, Vigo Video, Douyin, and Huoshan.
Dedicated to building global platforms of creation and interaction, ByteDance now has a portfolio of applications available in over 150 markets and 75 languages. For example, TikTok, Helo, Vigo Video, Douyin, and Huoshan.

Job ID: 148291745

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Jobs in Top Cities

Popular Jobs

Last Updated: 29-06-2026 05:50:41 AM

Homejobs in SingaporeTech Lead, Machine Learning Engineer - Global E-Commerce (Conversational AI)

Similar Jobs

EG Agentic AI Engineer (AI Tech Lead)

Ncs

3-5 yrs

Singapore

Skills:

Servicenow, Kafka, Docker, Apache Kafka, Rest Apis, Python, LangChain, pgvector, Qdrant, Anthropic, Claude, vLLM, LangGraph, DeepSeek, Qwen, Mistral, Pulsar, LLM orchestration, AI observability tools, Milvus, RAG pipelines, OpenAI, LlamaIndex

Tech Lead ASR / TTS / Speech LLM (IC + Mentor)

outcomesai

7-10 yrs

Singapore

Skills:

Pytorch, Kubernetes, CosyVoice, Opus, TensorRT, ESPnet, Glow-TTS, TTS, Speaker Diarization, Speech LLM, LoRA adapters, FastPitch, VITS, StyleTTS2, ASR, Triton Inference Server, NaturalSpeech-3, Grad-TTS, BigVGAN, Fairseq, model smart voice activity detection, G.711 ?-law, vocoders, CTC, RNN-T, turn detection, zero-shot cloning, UnivNet, nemo

Tech Lead (Talend)

itcan pte. limited

5-7 yrs

SGD 8,500 - 11,000 per month

Singapore, Cecil Street

Skills:

Java, Unix, Ms Sql Server, PostgreSQL, Svn, Sql, Talend Data Integration, Git, Linux, MySQL, Talend Open Studio, Oracle, Python, Etl

RH OCP Tech Lead (Red Hat Openshift Container Platform)

uc tech pte. ltd.

5-7 yrs

SGD 9,000 - 19,000 per month

Singapore, Lavender Street

Skills:

Python Scripting, Data Integration, Talend, H2O.ai, Red Hat OpenShift Container Platform OCP, platform upgrades, ETL concepts, patch planning, CVE vulnerability assessments, AI ML platforms

DevSecOps Tech Lead (Day 2 Ops, Red Hat Openshift)

ibm singapore pte ltd

5-7 yrs

SGD 8,000 - 12,000 per month

Singapore, Marina

Skills:

Python Scripting, Data Integration, Talend, H2O.ai, Red Hat OpenShift Container Platform OCP, platform upgrades, ETL concepts, IT service management practices, patch planning, CVE vulnerability assessments, AI ML platforms