Search by job, company or skills

H

LLM Inference Engineer

Fresher
SGD 6,000 - 14,000 per month
new job description bg glownew job description bg glownew job description bg svg
  • Posted 20 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Location:

Singapore

Onsite Interview:

Required (Singapore or Beijing)

Level:

Early Career / High-Potential Engineers
We are building high-performance large model inference systems that push GPUs to their limits.

We are looking for exceptional engineers to design and optimize production-grade LLM inference infrastructure, achieving:

  • Extreme performance

  • Ultra-low latency

  • Maximum GPU utilization

  • Lowest cost per token

This is a core role that directly impacts our company's technical competitiveness.

What You Will Do

Production LLM Inference Systems

Build and optimize high-performance inference services based on:

  • vLLM

  • TensorRT-LLM

  • SGLang

  • FasterTransformer

  • TGI (Text Generation Inference)

Deploy production-grade inference systems serving real workloads.

Inference Performance Optimization

Optimize:

  • Latency

  • Throughput

  • Cost per token

Using techniques such as:

  • KV cache optimization

  • Continuous batching

  • Paged attention

  • Speculative decoding

  • Prefix caching

  • Quantization (FP8 / INT8 / INT4)

GPU-Level Optimization

Improve GPU efficiency by optimizing:

  • Memory bandwidth utilization

  • Tensor Core utilization

  • Kernel launch efficiency

Work involving:

  • CUDA

  • Triton kernels

  • FlashAttention

  • Custom CUDA kernels

Distributed Inference

Design and implement:

  • Tensor parallelism

  • Pipeline parallelism

  • Expert parallelism (MoE)

  • Multi-node inference

Using:

  • NCCL

  • CUDA

  • RDMA

Large-Scale Inference Platform

Build large-scale inference platforms including:

  • Inference scheduler

  • Load balancer

  • Multi-tenant inference system

Supporting:

  • Thousands of GPUs

  • Billions of tokens per day

Cost Optimization

Reduce cost per token through:

Advanced batching strategies

GPU memory optimization

Cluster scheduling

Technical Requirements

Strong experience or project exposure in several of the following areas:

GPU & Low-Level Optimization

  • CUDA / CUDA Kernel development

  • GPU performance tuning

  • Kernel / Operator optimization

  • Triton / TVM

  • TensorRT acceleration

Large Model & Inference

  • Megatron-LM

  • DeepSpeed

  • Colossal-AI

  • vLLM / SGLang

  • Large model inference optimization

  • Quantization / KV cache optimization (plus)

Distributed & Systems

Distributed Systems

PyTorch Distributed

NCCL

HPC (High Performance Computing)

AI Infrastructure / ML Infra

Multi-GPU / Multi-node training systems

Preferred Background

  • Bachelor's from a strong university (CS/EE/AI) or Master's degree preferred

  • Strong foundation in:

    • Computer Systems

    • Operating Systems

    • Parallel Computing

    • Distributed Systems

    • Linear Algebra & ML fundamentals

  • Competitive programming / ACM / research experience is a plus

  • Publications or open-source contributions are a plus

More Info

Job Type:
Industry:
Employment Type:

Job ID: 143350857