Search by job, company or skills

Nicoll Curtin

Artificial Intelligence Engineer

5-7 Years
Save
new job description bg glownew job description bg glow
  • Posted 2 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

We're looking for an AI-Engineer within Distributed LLM Training & Infrastructure to work on large-scale model training infrastructure across distributed GPU environments. This role focuses on improving how LLMs are trained at scale optimising performance, cost, and efficiency across multi-node systems.

The Role

  • Build and optimise distributed LLM training pipelines using PyTorch
  • Work with frameworks such as Megatron-LM and DeepSpeed for large-scale training
  • Improve multi-node GPU performance (throughput, memory usage, NCCL communication)
  • Design and run benchmarking frameworks (tokens/sec, cost, MFU, latency)
  • Develop standardised training recipes and playbooks for production-grade environments

What You'll Work On

  • Core LLM training systems (not application-layer AI)
  • Distributed systems challenges across multi-GPU, multi-node setups
  • Performance optimisation and scaling of large models in production environments

Ideal Background

  • 5-7 Years Experience in distributed ML / ML systems
  • Strong hands-on experience with PyTorch and multi-node, multi-GPU training
  • Deep understanding of parallelism strategies (FSDP, tensor, pipeline)
  • Exposure to Megatron-LM, DeepSpeed, or similar training frameworks
  • Strong focus on benchmarking, optimisation, and improving training efficiency

Why This Role

  • Work on high-impact problems in large-scale AI training and infrastructure
  • High ownership within a lean, senior team
  • Opportunity to define training standards and best practices for scalable AI systems

If you're interested, feel free to apply or reach out directly for a confidential discussion.

Only shortlisted candidates will be contacted.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 147286059

Similar Jobs

Singapore

Skills:

JaxPytorchDeepSpeedmodel parallelismNCCL communication patternsBenchmarkingmulti-node GPU systemsFSDPtensor pipeline parallelism

Singapore, Ang Mo Kio

Skills:

Data AnalyticsUipathPredictive AnalyticsRpaPower AutomateSqlTensorflowNlpGitComputer VisionPytorchPythonlow-code platformsscikit-learnAiAzure Cognitive Services

Singapore

Skills:

TensorflowNltkGitRest API DevelopmentPytorchFlaskFastAPIPythonWord parsingembeddingsDoc AIPrompt engineeringLLMsHugging FaceOCR integrationsemantic similarityNLP GenAIspaCyunstructured datasetsCI CD exposureVertex AIstructured datasetsText preprocessing

Singapore

Skills:

PytorchNatural Language ProcessingJavaTensorflowPythonMachine Learning Algorithmsdeep learning frameworkslarge language model fine-tuninganomaly detectionGogenerative AItime series forecasting