Search by job, company or skills

H

AI Inference Engineer

3-5 Years
SGD 6,500 - 10,000 per month
Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 4 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Description

We are looking for a highly skilled engineer to build, optimize, and maintain high-performance inference services for large language models (LLMs) and multimodal models. You will work closely with algorithm, systems, and product teams to deliver best-in-class performance, stability, and efficiency in production environments-ensuring low-latency, highly available AI services for tens of millions of users.

Key Responsibilities

High-Performance Computing & Kernel Optimization

  • Perform deep GPU/CUDA kernel optimization, including memory access pattern tuning, instruction-level parallelism, and warp-level optimization to fully utilize hardware capabilities
  • Develop and optimize custom high-performance operators using advanced DSLs or compiler frameworks such as Triton and TVM
  • Identify and resolve performance bottlenecks in scenarios such as operator fusion and quantization
  • Utilize performance profiling tools (e.g., NVIDIA Nsight Compute/Systems, NCU) for full-stack analysis and bottleneck identification from application to hardware

AI Infrastructure Platform Development

  • Contribute to the development and maintenance of a company-level AI platform supporting large-scale model training and inference
  • Build elastic GPU resource management and scheduling systems based on Kubernetes
  • Leverage open-source tools such as Volcano, Kubeflow, and Fluid to optimize AI workload scheduling and data acceleration
  • Design and implement end-to-end MLOps pipelines, including task orchestration, experiment tracking, model versioning, and service deployment

Distributed Training Systems

  • Apply and optimize distributed training frameworks such as Megatron-LM, DeepSpeed, and Colossal-AI
  • Solve system challenges across data parallelism, model parallelism, pipeline parallelism, and hybrid parallelism
  • Optimize distributed communication using NCCL, MPI, RDMA/InfiniBand in large-scale clusters (thousands of GPUs)
  • Diagnose and resolve issues related to network congestion, topology awareness, and cluster stability
  • Ensure reliability, fault tolerance, and performance tuning for large-scale training clusters

LLM Inference Optimization

  • Build and optimize high-throughput, low-latency inference services
  • Deeply customize and optimize inference engines such as vLLM, TensorRT-LLM, SGLang, and TGI
  • Implement and optimize key acceleration techniques including PagedAttention, continuous batching, quantization (INT8/FP8/AWQ), dynamic batching, and prefix caching
  • Design highly available inference architectures supporting multi-model, multi-version deployment with dynamic scaling

Qualifications

Basic Requirements

  • Bachelor's degree or above in Computer Science, Electrical Engineering, or a related field
  • 3+ years of system software development experience, including at least 1 year in AI systems or high-performance computing
  • Proficiency in C++ and Python, with strong foundations in algorithms, data structures, and system programming
  • Deep understanding of modern GPU architectures (e.g., NVIDIA Hopper/Ampere) or AI accelerators, with hands-on CUDA optimization experience

Core Skills & Experience (at least two of the following)

Training

  • Experience with PyTorch distributed development
  • Hands-on experience with Megatron-LM or DeepSpeed
  • Proven ability to solve large-scale distributed training challenges (communication, memory, scheduling)

Inference

  • Experience customizing or optimizing inference frameworks such as vLLM or TensorRT-LLM
  • Experience optimizing high-concurrency online inference services (QPS 1000)

Platform

  • Experience building AI platforms based on Kubernetes
  • Familiarity with Kubeflow, Volcano, and related ecosystems
  • Experience with service mesh technologies such as Istio or Envoy for managing inference traffic

Preferred Qualifications

  • Publications in top-tier conferences such as ACL, MLSys, ASPLOS, OSDI, or SOSP
  • Experience deploying and optimizing AI workloads on large-scale clusters ( 1,000 GPUs)
  • Familiarity with heterogeneous computing and experience with DPU or custom AI chips
  • Deep understanding of AI compilers (e.g., MLIR, XLA) with hands-on development experience

Soft Skills

  • Strong problem-solving skills with the ability to break down complex systems and identify root causes
  • High level of ownership, self-motivation, and ability to drive ambiguous and challenging problems forward
  • Excellent communication and collaboration skills across research, product, and infrastructure teams

More Info

Job Type:
Industry:
Employment Type:

Job ID: 147015353