AI Inference Engineer

hpc ai technology pte. ltd.

Singapore

3-5 Years

SGD 6,500 - 10,000 per month

Save

Posted 4 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

We are looking for a highly skilled engineer to build, optimize, and maintain high-performance inference services for large language models (LLMs) and multimodal models. You will work closely with algorithm, systems, and product teams to deliver best-in-class performance, stability, and efficiency in production environments-ensuring low-latency, highly available AI services for tens of millions of users.

Key Responsibilities

High-Performance Computing & Kernel Optimization

Perform deep GPU/CUDA kernel optimization, including memory access pattern tuning, instruction-level parallelism, and warp-level optimization to fully utilize hardware capabilities
Develop and optimize custom high-performance operators using advanced DSLs or compiler frameworks such as Triton and TVM
Identify and resolve performance bottlenecks in scenarios such as operator fusion and quantization
Utilize performance profiling tools (e.g., NVIDIA Nsight Compute/Systems, NCU) for full-stack analysis and bottleneck identification from application to hardware

AI Infrastructure Platform Development

Contribute to the development and maintenance of a company-level AI platform supporting large-scale model training and inference
Build elastic GPU resource management and scheduling systems based on Kubernetes
Leverage open-source tools such as Volcano, Kubeflow, and Fluid to optimize AI workload scheduling and data acceleration
Design and implement end-to-end MLOps pipelines, including task orchestration, experiment tracking, model versioning, and service deployment

Distributed Training Systems

Apply and optimize distributed training frameworks such as Megatron-LM, DeepSpeed, and Colossal-AI
Solve system challenges across data parallelism, model parallelism, pipeline parallelism, and hybrid parallelism
Optimize distributed communication using NCCL, MPI, RDMA/InfiniBand in large-scale clusters (thousands of GPUs)
Diagnose and resolve issues related to network congestion, topology awareness, and cluster stability
Ensure reliability, fault tolerance, and performance tuning for large-scale training clusters

LLM Inference Optimization

Build and optimize high-throughput, low-latency inference services
Deeply customize and optimize inference engines such as vLLM, TensorRT-LLM, SGLang, and TGI
Implement and optimize key acceleration techniques including PagedAttention, continuous batching, quantization (INT8/FP8/AWQ), dynamic batching, and prefix caching
Design highly available inference architectures supporting multi-model, multi-version deployment with dynamic scaling

Qualifications

Basic Requirements

Bachelor's degree or above in Computer Science, Electrical Engineering, or a related field
3+ years of system software development experience, including at least 1 year in AI systems or high-performance computing
Proficiency in C++ and Python, with strong foundations in algorithms, data structures, and system programming
Deep understanding of modern GPU architectures (e.g., NVIDIA Hopper/Ampere) or AI accelerators, with hands-on CUDA optimization experience

Core Skills & Experience (at least two of the following)

Training

Experience with PyTorch distributed development
Hands-on experience with Megatron-LM or DeepSpeed
Proven ability to solve large-scale distributed training challenges (communication, memory, scheduling)

Inference

Experience customizing or optimizing inference frameworks such as vLLM or TensorRT-LLM
Experience optimizing high-concurrency online inference services (QPS 1000)

Platform

Experience building AI platforms based on Kubernetes
Familiarity with Kubeflow, Volcano, and related ecosystems
Experience with service mesh technologies such as Istio or Envoy for managing inference traffic

Preferred Qualifications

Publications in top-tier conferences such as ACL, MLSys, ASPLOS, OSDI, or SOSP
Experience deploying and optimizing AI workloads on large-scale clusters ( 1,000 GPUs)
Familiarity with heterogeneous computing and experience with DPU or custom AI chips
Deep understanding of AI compilers (e.g., MLIR, XLA) with hands-on development experience

Soft Skills

Strong problem-solving skills with the ability to break down complex systems and identify root causes
High level of ownership, self-motivation, and ability to drive ambiguous and challenging problems forward
Excellent communication and collaboration skills across research, product, and infrastructure teams