Our client is a fast-growing technology company at the forefront of AI innovation, building large-scale multimodal and video intelligence products used by millions of users globally. They are seeking a Senior AI Infrastructure Engineer to design and scale the infrastructure powering next-generation AI models across video understanding, recommendation systems, generative AI, content moderation, and multimodal foundation models.
This is an exciting opportunity to work alongside world-class AI researchers and engineers, solving complex infrastructure challenges at scale while driving the future of video AI technologies.
Job Responsibilities
- Design, build, and maintain scalable infrastructure supporting distributed AI model training and inference.
- Optimize GPU clusters, compute resources, and system performance for large-scale AI workloads.
- Develop and enhance multimodal and video AI training pipelines to improve efficiency and scalability.
- Improve platform reliability, observability, fault tolerance, and deployment processes.
- Partner closely with AI researchers and machine learning teams to accelerate experimentation and model development cycles.
- Build infrastructure tooling for model serving, evaluation, scheduling, orchestration, and resource management.
- Design and optimize data storage, processing, and throughput systems for large-scale video datasets.
- Support the deployment and scaling of real-time AI inference services with high availability and low latency.
- Implement CI/CD pipelines and automation frameworks to streamline the AI model lifecycle.
- Evaluate and introduce emerging technologies related to distributed systems, GPU acceleration, and AI platform engineering.
- Contribute to technical architecture decisions, engineering best practices, and mentor junior team members.
Job Requirements
- Bachelor's or Master's degree in Computer Science, Software Engineering, Artificial Intelligence, or a related discipline.
- 3–5+ years of experience in infrastructure engineering, distributed systems, machine learning platforms, or related areas.
- Strong understanding of distributed computing principles and large-scale system architecture.
- Hands-on experience with Kubernetes, Docker, and cloud-native infrastructure environments.
- Experience managing GPU clusters and optimizing AI compute resources.
- Familiarity with distributed machine learning frameworks such as PyTorch Distributed, DeepSpeed, Ray, Horovod, or Megatron-LM.
- Experience working with major cloud platforms such as AWS, GCP, or Azure.
- Solid understanding of high-performance networking, storage systems, and large-scale data pipelines.
- Experience building and scaling model serving platforms and online inference systems.
- Strong troubleshooting, performance tuning, and systems optimization capabilities.
www.dadaconsultants.com
Licence Number: 18S9037EA
Registration Number: R23112003
Business Registration Number: 201735941W