About the Role
We are building a next-generation Enterprise AI Platform to power large-scale AI workloads across hybrid environments. We are looking for an AI Infrastructure Specialist to design, build, and scale the computing foundation that enables advanced AI use cases including Generative AI, Vision AI, RAG pipelines, and predictive analytics.
This role sits at the intersection of cloud, GPU infrastructure, Kubernetes, and MLOps. You will own and optimize AI workloads across:
- On-premise GPU clusters
- Public cloud platforms
- Private AI cloud environments
- Distributed / edge environments
Key Responsibilities
- Design and manage hybrid AI infrastructure across on-prem and cloud environments
- Build and operate GPU-enabled Kubernetes clusters
- Optimize high-performance compute environments (NVIDIA GPUs, CUDA, NVLink)
- Enable scalable ML training and inference platforms
- Implement containerized AI environments and orchestration workflows
- Support MLOps pipelines and model lifecycle management
- Enable CI/CD pipelines for ML deployment
- Integrate model registry, artifact storage, and observability tools
- Manage vector databases and model inference endpoints
- Ensure secure, resilient, and compliant AI infrastructure
- Drive cost optimization, performance tuning, and capacity planning
Mandatory Requirements
- 3+ years of experience in infrastructure, cloud engineering, or platform engineering
- Strong hands-on experience with Kubernetes (production environments)
- Experience managing GPU infrastructure (NVIDIA GPUs, CUDA)
- Solid experience with at least one major cloud provider (AWS, Azure, or GCP)
- Strong knowledge of containerization (Docker)
- Experience with Infrastructure-as-Code (Terraform, Pulumi, or similar)
- Strong Linux systems and troubleshooting skills
EA Number: 11C4879