We're looking for an AI Engineer within Distributed LLM Training & Infrastructure to work on large-scale model training infrastructure across distributed GPU environments. This role focuses on improving how LLMs are trained at scale optimising performance, cost, and efficiency across multi-node systems.
The Role
- Build and optimise distributed LLM training pipelines using PyTorch
- Work with frameworks such as Megatron-LM and DeepSpeed for large-scale training
- Improve multi-node GPU performance (throughput, memory usage, NCCL communication)
- Design and run benchmarking frameworks (tokens/sec, cost, MFU, latency)
- Develop standardised training recipes and playbooks for production-grade environments
What You'll Work On
- Core LLM training systems (not application-layer AI)
- Distributed systems challenges across multi-GPU, multi-node setups
- Performance optimisation and scaling of large models in production environments
Ideal Background
- Experience in distributed ML / ML systems
- Strong hands-on experience with PyTorch and multi-node, multi-GPU training
- Deep understanding of parallelism strategies (FSDP, tensor, pipeline)
- Exposure to Megatron-LM, DeepSpeed, or similar training frameworks
- Strong focus on benchmarking, optimisation, and improving training efficiency
Why This Role
- Work on high-impact problems in large-scale AI training and infrastructure
- High ownership within a lean, senior team
- Opportunity to define training standards and best practices for scalable AI systems
If you're interested, feel free to apply or reach out directly for a confidential discussion.
Only shortlisted candidates will be contacted.