Search by job, company or skills

Y

AI Systems & Platform Lead

15-17 Years
SGD 12,000 - 17,000 per month
new job description bg glownew job description bg glownew job description bg svg
  • Posted 20 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Key Responsibilities

  • Lead and mentor a team of system engineers responsible for delivery, operations, escalations, and technical improvement.
  • Manage and optimise OS lifecycle for GPU and CPU nodes, including patching, kernel tuning, driver and firmware updates, configuration hardening, and automation.
  • Oversee bare-metal provisioning and deployment for GPU platforms, including NVIDIA stack components such as CUDA, drivers, NCCL, and container runtimes.
  • Manage Kubernetes (k8s) clusters supporting GPU workload orchestration, including autoscaling, scheduling, node health, multi-tenant resource isolation, and capacity allocation.
  • Run and enhance container platforms (Docker/CRI-O), including image management, registry security, runtime troubleshooting, and performance optimisation.
  • Integrate and operate monitoring and telemetry systems, such as DCGM, Prometheus, node exporters, Weka telemetry, and alert pipelines.
  • Drive continuous improvement in GPU utilisation efficiency, benchmarking, platform stability, and cost/performance optimisation.
  • Own operational workflows including incident, problem, and change management, RCA execution, and improvement tracking.
  • Lead capacity planning across compute, GPU, network, and storage layers to support scale-up and customer growth.
  • Maintain complete system documentation including SOPs, runbooks, KB articles, architecture diagrams, configuration standards, and platform records.
  • Oversee the ticketing lifecycle across internal operations, customer interfaces, and vendor escalation including RMA tracking and replacement management.
  • Ensure strong SLA alignment and customer interaction through accurate troubleshooting and triage across GPU, Kubernetes, and OS environments.
  • Support ISO27001 and SOC2 compliance through configuration standards, access controls, logging, vulnerability remediation, and platform security practices.
  • Maintain audit readiness and evidence collection for operational and security compliance.
  • Collaborate with vendors, partners, and engineering teams to resolve systemic GPU, container, or orchestration issues.
  • Support budgeting and forecasting related to GPU expansion, licensing, storage growth, and platform evolution.

Skills and Experience

  • Bachelor's degree in computer science, Engineering, or related discipline.
  • 15+ years experience in solution architecture, cloud engineering, HPC, or AI infrastructure.
  • Deep hands-on experience with Linux systems, GPU platforms, Kubernetes orchestration, and container runtimes.
  • Strong technical knowledge across drivers, firmware, OS tuning, and performance benchmarking.
  • Practical experience supporting large-scale GPU clusters or HPC environments.
  • Practical experience with monitoring and telemetry platforms such as DCGM, Prometheus, Grafana, and Weka.
  • Good understanding of platform automation and infrastructure-as-code tooling (e.g., Ansible, Terraform).
  • Strong knowledge of troubleshooting processes across complex stack layers (OS, container, GPU, network, storage).
  • Excellent communication skills to work effectively across technical and non-technical stakeholders.
  • Strong documentation discipline and ability to translate technical concepts into clear written content.
  • Knowledge of ticketing platforms and RMA management processes in large-scale compute environments.
  • Excellent documentation and diagramming abilities.
  • Self-driven, analytical, and detail-oriented.

More Info

Job Type:
Industry:
Employment Type:

Job ID: 144945403