AI Cluster Technology Director

palebluedot ai

Singapore

10-12 Years

Save

Posted 14 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Overseas AI Cluster Operations & Delivery Lead

PaleBlueDot AI is looking for an experienced leader to drive the design, deployment, and operations of large-scale overseas AI compute clusters.

This role will take ownership of end-to-end delivery for 10,000-GPU-class AI infrastructure, covering architecture planning, deployment execution, operational readiness, customer technical engagement, and long-term performance optimization.

If you have deep experience in AI infrastructure, HPC, hyperscale data centers, and cross-cultural team leadership, this is an opportunity to help build mission-critical AI compute platforms at global scale.

What You'll Do

Lead the overall architecture design of overseas AI clusters across compute, networking, storage, and liquid cooling
Translate customer AI workload requirements into scalable, reliable, and high-performance technical solutions
Own the full deployment lifecycle, including planning, procurement, delivery acceptance, rack installation, cabling, commissioning, and production go-live
Drive execution across data center facilities teams, hardware vendors, liquid cooling partners, and other stakeholders
Build and lead a high-performing overseas local operations team, including 24/7 operations processes, SOPs, and incident response mechanisms
Establish monitoring, alerting, logging, and performance analysis systems to ensure full-stack observability and operational stability
Act as the senior technical escalation point for complex, high-impact incidents, leading troubleshooting, RCA, and long-term improvements
Work directly with customer technical teams on solution presentations, PoC support, and deep technical discussions
Ensure delivered services meet or exceed SLA expectations
Continuously optimize cluster efficiency, with focus on PUE, WUE, compute utilization, and operating cost

What We're Looking For

Bachelor's degree or above in Computer Science, Electronic Engineering, or a related field
10+ years of experience in large-scale data center, HPC, or AI cluster operations and management
Proven experience delivering advanced overseas AI compute clusters or hyperscale data center projects
Deep expertise in AI cluster architecture, including environments such as NVIDIA DGX, SuperPOD, or GPU-as-a-Service
Strong understanding of InfiniBand, RoCE, distributed storage, and large-scale infrastructure design
Hands-on experience with immersion cooling and/or cold plate liquid cooling deployments or operations
Strong working knowledge of Linux, Slurm, Kubernetes, Prometheus, Grafana, Ansible, and Python
5+ years of technical team leadership experience, with the ability to lead and grow teams in cross-cultural environments
Excellent communication skills and strong customer-facing capability
Fluent English is strongly preferred

Why Join PaleBlueDot AI