AI engineer (Infrastructure)

CENTRICS NETWORKS PTE. LTD.

Sin Ming, Singapore

2-5 Years

SGD 4,500 - 5,800 per month

Save

Posted 4 days ago
Be among the first 10 applicants

Early Applicant

Job Description

Role Overview

As a Senior AI Infrastructure Engineer, you will be the architect and custodian of the high-performance computing (HPC) environment that powers our most ambitious AI initiatives. Your primary focus will be the end-to-end orchestration of NVIDIA-based GPU clusters, ensuring that our clients can train and deploy Large Language Models (LLMs) with zero friction, maximum throughput, and peak efficiency.

Key Responsibilities

1. NVIDIA Hardware Architecture & Optimization

Cluster Design: Design and configure large-scale GPU clusters utilizing NVIDIA H100/A100/H200 nodes.
High-Speed Interconnects: Implement and tune NVIDIA NVLink and NVSwitch for intra-node communication, and InfiniBand/RoCE for low-latency inter-node networking.
Performance Tuning: Diagnose and resolve hardware bottlenecks (thermal throttling, PCIe lane congestion, or memory bandwidth issues) to ensure 99.9% availability for training jobs.

2. Orchestration & Containerization

Kubernetes Management: Deploy and manage AI workloads using NVIDIA GPU Operator on Kubernetes to automate the management of NVIDIA software components (drivers, container runtime).
Environment Standardizing: Build and maintain optimized Docker/Apptainer containers tailored for CUDA-accelerated frameworks like PyTorch and TensorFlow.

3. MLOps & Infrastructure as Code (IaC)

Automated Provisioning: Use Terraform or Ansible to treat our GPU fleet as code, enabling rapid scaling of hybrid-cloud or on-premise environments.
Pipeline Development: Build automated CI/CD pipelines that integrate with MLOps tools (e.g., Kubeflow, MLflow) to streamline the transition from model training to production inference.

4. Data Pipeline & Storage Optimization

Throughput Management: Optimize high-speed storage solutions (e.g., WEKA, GPFS) to ensure data feeding is fast enough to keep GPUs at 100% utilization.
Latency Reduction: Engineer low-latency inference endpoints for real-time AI applications using NVIDIA Triton Inference Server.

Requirements and Outlook
- Background: Typically requires a degree in Computer Science, Engineering, or relevant experience in data center operations and MLOps tools.
- Certification: Specialized certifications, such as the NVIDIA NCP-AI Infrastructure