
Search by job, company or skills
As a Senior AI Infrastructure Engineer, you will be the architect and custodian of the high-performance computing (HPC) environment that powers our most ambitious AI initiatives. Your primary focus will be the end-to-end orchestration of NVIDIA-based GPU clusters, ensuring that our clients can train and deploy Large Language Models (LLMs) with zero friction, maximum throughput, and peak efficiency.
1. NVIDIA Hardware Architecture & Optimization
Cluster Design: Design and configure large-scale GPU clusters utilizing NVIDIA H100/A100/H200 nodes.
High-Speed Interconnects: Implement and tune NVIDIA NVLink and NVSwitch for intra-node communication, and InfiniBand/RoCE for low-latency inter-node networking.
Performance Tuning: Diagnose and resolve hardware bottlenecks (thermal throttling, PCIe lane congestion, or memory bandwidth issues) to ensure 99.9% availability for training jobs.
2. Orchestration & Containerization
Kubernetes Management: Deploy and manage AI workloads using NVIDIA GPU Operator on Kubernetes to automate the management of NVIDIA software components (drivers, container runtime).
Environment Standardizing: Build and maintain optimized Docker/Apptainer containers tailored for CUDA-accelerated frameworks like PyTorch and TensorFlow.
3. MLOps & Infrastructure as Code (IaC)
Automated Provisioning: Use Terraform or Ansible to treat our GPU fleet as code, enabling rapid scaling of hybrid-cloud or on-premise environments.
Pipeline Development: Build automated CI/CD pipelines that integrate with MLOps tools (e.g., Kubeflow, MLflow) to streamline the transition from model training to production inference.
4. Data Pipeline & Storage Optimization
Throughput Management: Optimize high-speed storage solutions (e.g., WEKA, GPFS) to ensure data feeding is fast enough to keep GPUs at 100% utilization.
Latency Reduction: Engineer low-latency inference endpoints for real-time AI applications using NVIDIA Triton Inference Server.
Requirements and Outlook
Job ID: 144616399