Search by job, company or skills

B

Tech Lead, Machine Learning Engineer - Global E-Commerce (Conversational AI)

4-6 Years
Save
new job description bg glownew job description bg glow
  • Posted 3 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Responsibilities

About the team We are building the next generation of conversational AI for Global E-commerce - a unified Agent system that learns from every interaction, runs in 30+ languages, and is deployed across one of the largest e-commerce surfaces on the internet. Our 2026 north star is a self-evolving Agent: post-training, harness, tools, memory, and evaluation form one closed loop, and every served conversation becomes training, evaluation, and retrieval signal for the next iteration. Business surface - buyer, seller, dispute & appeals, operations - is the substrate. Our work is foundational LLM + Agent engineering: post-training, agent harness, tool design, memory, evaluation, inference, multilinguality. We are hiring people who want to push the SOTA of these systems in production, at scale, with hundreds of millions of users in the loop. What we work on - LLM post-training & alignment - large-scale SFT, DPO/IPO/KTO, online RL (RLHF / RLAIF / RLVR), reward modeling, preference data curation, long-context training, distillation, QAT. We train and adapt frontier-class open-weights models (≥7B → ≥70B) and our own continually-pretrained checkpoints on internal infra (FSDP / DeepSpeed / Megatron-style stacks). - Agent foundations - harness design (context engineering, sub-agents, durable execution, parallel tool use), tool design (ACI principles, namespaced surfaces, poka-yoke, instrumented traces), memory (episodic + semantic + skill-shaped), MCP and Skill-style extensibility. We treat tools and prompts as APIs and iterate against production traces. - Auto-eval and observability - LLM-as-judge with calibrated human agreement, real-traffic replay, failure-mode taxonomies, regression + safety + cost + latency harnesses. We have moved root-cause analysis on a single case from 13 engineer-days to 3 minutes auto. - Self-evolving systems - every served conversation becomes a candidate for training data, eval set membership, retrieval index, and skill induction, with privacy and quality gates. The flywheel is the product. - Inference & serving - vLLM / TensorRT-LLM, MoE, speculative decoding, KV-cache reuse and prompt caching, multi-tenant low-latency serving. Cost per resolved conversation is a first-class metric. - Multilinguality & locale grounding - 30+ languages, low-resource adaptation, faithful translation, locale-aware reasoning, cross-cultural tone. - Reasoning & long-context modeling - chain-of-thought / planning post-training, reasoning-trace supervision, long-context training and serving, retrieval-augmented reasoning, self-consistency and verifier models. Responsibilities - Set technical direction. Own a multi-quarter roadmap across one or more of: post-training, agent harness, evaluation, self-evolving data flywheel, serving. Translate north-star metrics into a sequence of 2-3 high-ROI bets per quarter and ship them. - Compound the team. Hire and develop 1-3 strong ICs. Design their work surfaces for growth, not just dispatch. Raise the median technical bar through design review, code review, and 1:1 framing. - Stay in the loop with the model. Tech Lead is not a manager role. You still write the load-bearing PRs, propose the core abstractions, and write the design docs that decide the team's ceiling for the next 2-3 quarters. - Drive cross-team alignment. Partner with foundation-model, infra, product, and adjacent algorithm teams own sign-off on cross-cutting technical decisions. - Observability and rollback. Build the per-turn tracing, tool-call analytics, and failure-mode taxonomies that let the team diagnose any regression within hours, not days.

Qualifications

Minimum Qualifications - BS / MS / PhD in CS, AI, Mathematics, or related quantitative field. - Hands-on experience in ML / NLP / applied DL. Top PhDs with strong publication record may qualify at 4+ years. - Strong Python and at least one of C++ / Go / Rust for production-path code. - Hands-on post-training or fine-tuning of frontier-class LLMs (≥7B, multi-node). Not API-only. - Has led at least one production LLM / Agent system from zero to one. Preferred Qualifications - LLM post-training - multi-node SFT / DPO / online RL on ≥7B models reward modeling preference data construction RLAIF / RLVR distillation QAT long-context training continual pre-training. - Agent engineering (Anthropic-style) - production agent harness, context engineering, sub-agents, durable execution, MCP / Skill-style extensibility, parallel tool use, computer use. - Reasoning & planning - chain-of-thought / reasoning-trace training, planner/critic decomposition, self-consistency, verifier models, multi-step reasoning evaluation. - LLM / Agent evaluation - LLM-as-judge with human-agreement calibration, tau-bench / SWE-bench / GAIA / BFCL-style harnesses, regression + safety + cost + latency-aware evaluation. - Inference & serving systems - vLLM / TensorRT-LLM, MoE, speculative decoding, KV-cache and prompt caching, low-latency multi-tenant serving. - Multilingual & cross-cultural reasoning - multilingual SFT/DPO, low-resource adaptation, faithful MT, locale-aware reasoning.

More Info

Job Type:
Function:
Employment Type:

About Company

ByteDance is a technology company operating a range of content platforms that inform, educate, entertain and inspire people across languages, cultures, and geographies.
Dedicated to building global platforms of creation and interaction, ByteDance now has a portfolio of applications available in over 150 markets and 75 languages. For example, TikTok, Helo, Vigo Video, Douyin, and Huoshan.
Dedicated to building global platforms of creation and interaction, ByteDance now has a portfolio of applications available in over 150 markets and 75 languages. For example, TikTok, Helo, Vigo Video, Douyin, and Huoshan.

Job ID: 148291745

Similar Jobs

Singapore

Skills:

UnixLinuxBashWindowsPythonMacosIncident ResponseCyber Security Operations CenterForensicsSaaS environmentsAutomation processes

Singapore

Skills:

Python ScriptingData IntegrationPatch planningPlatform upgradesETL conceptsIT service management practicesCVE vulnerability assessmentsAI ML platforms

Singapore, Ang Mo Kio

Skills:

TensorflowCloud ServicesPytorchPythonmachine learning frameworksAI-based video processingDevOps practicesAzure AIvideo analytics

Singapore

Skills:

react.js NosqlJavaVue.JSNode.jsPythonSqldata labeling workflowsAI integration patterns

Singapore, Ubi

Skills:

TensorflowCloud ServicesPytorchPythonmachine learning frameworksAI-based video processingDevOps practicesAzure AIvideo analytics