Job Description
We are looking for a highly skilled engineer to build, optimize, and maintain high-performance inference services for large language models (LLMs) and multimodal models. You will work closely with algorithm, systems, and product teams to deliver best-in-class performance, stability, and efficiency in production environments-ensuring low-latency, highly available AI services for tens of millions of users.
Key Responsibilities
High-Performance Computing & Kernel Optimization
- Perform deep GPU/CUDA kernel optimization, including memory access pattern tuning, instruction-level parallelism, and warp-level optimization to fully utilize hardware capabilities
- Develop and optimize custom high-performance operators using advanced DSLs or compiler frameworks such as Triton and TVM
- Identify and resolve performance bottlenecks in scenarios such as operator fusion and quantization
- Utilize performance profiling tools (e.g., NVIDIA Nsight Compute/Systems, NCU) for full-stack analysis and bottleneck identification from application to hardware
AI Infrastructure Platform Development
- Contribute to the development and maintenance of a company-level AI platform supporting large-scale model training and inference
- Build elastic GPU resource management and scheduling systems based on Kubernetes
- Leverage open-source tools such as Volcano, Kubeflow, and Fluid to optimize AI workload scheduling and data acceleration
- Design and implement end-to-end MLOps pipelines, including task orchestration, experiment tracking, model versioning, and service deployment
Distributed Training Systems
- Apply and optimize distributed training frameworks such as Megatron-LM, DeepSpeed, and Colossal-AI
- Solve system challenges across data parallelism, model parallelism, pipeline parallelism, and hybrid parallelism
- Optimize distributed communication using NCCL, MPI, RDMA/InfiniBand in large-scale clusters (thousands of GPUs)
- Diagnose and resolve issues related to network congestion, topology awareness, and cluster stability
- Ensure reliability, fault tolerance, and performance tuning for large-scale training clusters
LLM Inference Optimization
- Build and optimize high-throughput, low-latency inference services
- Deeply customize and optimize inference engines such as vLLM, TensorRT-LLM, SGLang, and TGI
- Implement and optimize key acceleration techniques including PagedAttention, continuous batching, quantization (INT8/FP8/AWQ), dynamic batching, and prefix caching
- Design highly available inference architectures supporting multi-model, multi-version deployment with dynamic scaling
Qualifications
Basic Requirements
- Bachelor's degree or above in Computer Science, Electrical Engineering, or a related field
- 3+ years of system software development experience, including at least 1 year in AI systems or high-performance computing
- Proficiency in C++ and Python, with strong foundations in algorithms, data structures, and system programming
- Deep understanding of modern GPU architectures (e.g., NVIDIA Hopper/Ampere) or AI accelerators, with hands-on CUDA optimization experience
Core Skills & Experience (at least two of the following)
Training
- Experience with PyTorch distributed development
- Hands-on experience with Megatron-LM or DeepSpeed
- Proven ability to solve large-scale distributed training challenges (communication, memory, scheduling)
Inference
- Experience customizing or optimizing inference frameworks such as vLLM or TensorRT-LLM
- Experience optimizing high-concurrency online inference services (QPS 1000)
Platform
- Experience building AI platforms based on Kubernetes
- Familiarity with Kubeflow, Volcano, and related ecosystems
- Experience with service mesh technologies such as Istio or Envoy for managing inference traffic
Preferred Qualifications
- Publications in top-tier conferences such as ACL, MLSys, ASPLOS, OSDI, or SOSP
- Experience deploying and optimizing AI workloads on large-scale clusters ( 1,000 GPUs)
- Familiarity with heterogeneous computing and experience with DPU or custom AI chips
- Deep understanding of AI compilers (e.g., MLIR, XLA) with hands-on development experience
Soft Skills
- Strong problem-solving skills with the ability to break down complex systems and identify root causes
- High level of ownership, self-motivation, and ability to drive ambiguous and challenging problems forward
- Excellent communication and collaboration skills across research, product, and infrastructure teams