Search by job, company or skills

evolution singapore

Artificial Intelligence Engineer

5-7 Years
Save
new job description bg glownew job description bg glow
  • Posted 21 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

About the Role

This role builds the foundation for production-grade distributed AI training at scale. You will design reusable training recipes, benchmarking frameworks, and evaluation standards that enable large customers to train and compare models efficiently across multi-node GPU clusters.

You'll work closely with platform, orchestration, and application engineers to turn distributed training best practices into repeatable, customer-facing templates.

Job Details

  • Build and maintain production-ready distributed training recipes using frameworks such as TorchTitan and Megatron-LM
  • Define model scaling baselines and tuning guidance across GPU counts, parallelism strategies, and checkpointing patterns
  • Design and run multi-node communication and performance benchmarks (throughput, MFU, cost, energy efficiency)
  • Create standardized evaluation harnesses and offline benchmarking suites for model comparison
  • Publish training efficiency playbooks and benchmark results to guide internal teams and customers

Job Requirements

  • 5–7 years of hands-on experience with distributed ML training (PyTorch/JAX, FSDP, DeepSpeed, multi-node GPU systems)
  • Deep expertise in GPU performance optimization, memory behavior, and NCCL communication patterns
  • Proven ability to debug convergence issues and optimize large-scale training throughput
  • Strong benchmarking discipline with experience designing controlled, repeatable experiments
  • Practical knowledge of model parallelism trade-offs (FSDP, tensor, pipeline parallelism)

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 147388473

Similar Jobs

Singapore

Skills:

PytorchDeepSpeedMegatron-LMNCCLFSDPtensor pipeline

Singapore, Ang Mo Kio

Skills:

Data AnalyticsUipathPredictive AnalyticsRpaPower AutomateSqlTensorflowNlpGitComputer VisionPytorchPythonlow-code platformsscikit-learnAiAzure Cognitive Services

Singapore

Skills:

TensorflowNltkGitRest API DevelopmentPytorchFlaskFastAPIPythonWord parsingembeddingsDoc AIPrompt engineeringLLMsHugging FaceOCR integrationsemantic similarityNLP GenAIspaCyunstructured datasetsCI CD exposureVertex AIstructured datasetsText preprocessing

Singapore

Skills:

PytorchNatural Language ProcessingJavaTensorflowPythonMachine Learning Algorithmsdeep learning frameworkslarge language model fine-tuninganomaly detectionGogenerative AItime series forecasting