Search by job, company or skills

C

AI engineer (Infrastructure)

2-5 Years
SGD 4,500 - 5,800 per month
new job description bg glownew job description bg glownew job description bg svg
  • Posted 4 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role Overview

As a Senior AI Infrastructure Engineer, you will be the architect and custodian of the high-performance computing (HPC) environment that powers our most ambitious AI initiatives. Your primary focus will be the end-to-end orchestration of NVIDIA-based GPU clusters, ensuring that our clients can train and deploy Large Language Models (LLMs) with zero friction, maximum throughput, and peak efficiency.

Key Responsibilities

1. NVIDIA Hardware Architecture & Optimization

  • Cluster Design: Design and configure large-scale GPU clusters utilizing NVIDIA H100/A100/H200 nodes.

  • High-Speed Interconnects: Implement and tune NVIDIA NVLink and NVSwitch for intra-node communication, and InfiniBand/RoCE for low-latency inter-node networking.

  • Performance Tuning: Diagnose and resolve hardware bottlenecks (thermal throttling, PCIe lane congestion, or memory bandwidth issues) to ensure 99.9% availability for training jobs.

2. Orchestration & Containerization

  • Kubernetes Management: Deploy and manage AI workloads using NVIDIA GPU Operator on Kubernetes to automate the management of NVIDIA software components (drivers, container runtime).

  • Environment Standardizing: Build and maintain optimized Docker/Apptainer containers tailored for CUDA-accelerated frameworks like PyTorch and TensorFlow.

3. MLOps & Infrastructure as Code (IaC)

  • Automated Provisioning: Use Terraform or Ansible to treat our GPU fleet as code, enabling rapid scaling of hybrid-cloud or on-premise environments.

  • Pipeline Development: Build automated CI/CD pipelines that integrate with MLOps tools (e.g., Kubeflow, MLflow) to streamline the transition from model training to production inference.

4. Data Pipeline & Storage Optimization

  • Throughput Management: Optimize high-speed storage solutions (e.g., WEKA, GPFS) to ensure data feeding is fast enough to keep GPUs at 100% utilization.

  • Latency Reduction: Engineer low-latency inference endpoints for real-time AI applications using NVIDIA Triton Inference Server.

    Requirements and Outlook

    • Background: Typically requires a degree in Computer Science, Engineering, or relevant experience in data center operations and MLOps tools.
    • Certification: Specialized certifications, such as the NVIDIA NCP-AI Infrastructure

More Info

Job Type:
Industry:
Employment Type:

Job ID: 144616399