Search by job, company or skills

C

Senior LLM Inference Engineer Performance & GPU Optimization

Fresher
Save
  • Posted 13 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Own the performance of large language models in production — the latency, the throughput, the cost-per-token. This is deep inference-optimization work: profiling and tuning at the GPU and serving-engine level to make models run faster and cheaper at scale. You'll join a small, senior team at an established enterprise software company building LLM-powered capabilities into its products.

What you'll do:

  • Optimize LLM inference for latency, throughput, and cost — at the kernel and serving-engine level
  • Profile and tune GPU performance (CUDA, TensorRT-LLM); apply quantization, speculative decoding, and batching strategies
  • Get the most out of serving frameworks like vLLM, SGLang, and Triton — and extend them where they fall short
  • Optimize across hardware targets where relevant (NVIDIA and other accelerators)
  • Partner with model and platform teams to take new architectures from works to fast

What you'll bring:

  • Deep experience optimizing deep-learning inference in production
  • Hands-on GPU programming and performance engineering (CUDA or equivalent)
  • Fluency with modern LLM serving stacks (vLLM / TensorRT-LLM / SGLang / Triton)
  • A track record of measurable performance wins (latency / throughput / cost)
  • Strong systems fundamentals and a profiling-first mindset

Nice to have:

  • Kernel-level contributions to open-source inference projects
  • Experience across multiple accelerator types
  • Distributed / multi-GPU serving experience

A rare role where deep performance work is the whole job, not a side quest.

More Info

Job Type:
Industry:
Function:
Employment Type:

Job ID: 148939903