
Search by job, company or skills
Location:
Singapore
Onsite Interview:
Required (Singapore or Beijing)
Level:
Early Career / High-Potential Engineers
We are building high-performance large model inference systems that push GPUs to their limits.
We are looking for exceptional engineers to design and optimize production-grade LLM inference infrastructure, achieving:
Extreme performance
Ultra-low latency
Maximum GPU utilization
Lowest cost per token
This is a core role that directly impacts our company's technical competitiveness.
Build and optimize high-performance inference services based on:
vLLM
TensorRT-LLM
SGLang
FasterTransformer
TGI (Text Generation Inference)
Deploy production-grade inference systems serving real workloads.
Optimize:
Latency
Throughput
Cost per token
Using techniques such as:
KV cache optimization
Continuous batching
Paged attention
Speculative decoding
Prefix caching
Quantization (FP8 / INT8 / INT4)
Improve GPU efficiency by optimizing:
Memory bandwidth utilization
Tensor Core utilization
Kernel launch efficiency
Work involving:
CUDA
Triton kernels
FlashAttention
Custom CUDA kernels
Design and implement:
Tensor parallelism
Pipeline parallelism
Expert parallelism (MoE)
Multi-node inference
Using:
NCCL
CUDA
RDMA
Build large-scale inference platforms including:
Inference scheduler
Load balancer
Multi-tenant inference system
Supporting:
Thousands of GPUs
Billions of tokens per day
Reduce cost per token through:
Advanced batching strategies
GPU memory optimization
Cluster scheduling
Strong experience or project exposure in several of the following areas:
CUDA / CUDA Kernel development
GPU performance tuning
Kernel / Operator optimization
Triton / TVM
TensorRT acceleration
Megatron-LM
DeepSpeed
Colossal-AI
vLLM / SGLang
Large model inference optimization
Quantization / KV cache optimization (plus)
Distributed Systems
PyTorch Distributed
NCCL
HPC (High Performance Computing)
AI Infrastructure / ML Infra
Multi-GPU / Multi-node training systems
Bachelor's from a strong university (CS/EE/AI) or Master's degree preferred
Strong foundation in:
Computer Systems
Operating Systems
Parallel Computing
Distributed Systems
Linear Algebra & ML fundamentals
Competitive programming / ACM / research experience is a plus
Publications or open-source contributions are a plus
Job ID: 143350857