Overseas AI Cluster Operations & Delivery Lead
PaleBlueDot AI is looking for an experienced leader to drive the design, deployment, and operations of large-scale overseas AI compute clusters.
This role will take ownership of end-to-end delivery for 10,000-GPU-class AI infrastructure, covering architecture planning, deployment execution, operational readiness, customer technical engagement, and long-term performance optimization.
If you have deep experience in AI infrastructure, HPC, hyperscale data centers, and cross-cultural team leadership, this is an opportunity to help build mission-critical AI compute platforms at global scale.
What You'll Do
- Lead the overall architecture design of overseas AI clusters across compute, networking, storage, and liquid cooling
- Translate customer AI workload requirements into scalable, reliable, and high-performance technical solutions
- Own the full deployment lifecycle, including planning, procurement, delivery acceptance, rack installation, cabling, commissioning, and production go-live
- Drive execution across data center facilities teams, hardware vendors, liquid cooling partners, and other stakeholders
- Build and lead a high-performing overseas local operations team, including 24/7 operations processes, SOPs, and incident response mechanisms
- Establish monitoring, alerting, logging, and performance analysis systems to ensure full-stack observability and operational stability
- Act as the senior technical escalation point for complex, high-impact incidents, leading troubleshooting, RCA, and long-term improvements
- Work directly with customer technical teams on solution presentations, PoC support, and deep technical discussions
- Ensure delivered services meet or exceed SLA expectations
- Continuously optimize cluster efficiency, with focus on PUE, WUE, compute utilization, and operating cost
What We're Looking For
- Bachelor's degree or above in Computer Science, Electronic Engineering, or a related field
- 10+ years of experience in large-scale data center, HPC, or AI cluster operations and management
- Proven experience delivering advanced overseas AI compute clusters or hyperscale data center projects
- Deep expertise in AI cluster architecture, including environments such as NVIDIA DGX, SuperPOD, or GPU-as-a-Service
- Strong understanding of InfiniBand, RoCE, distributed storage, and large-scale infrastructure design
- Hands-on experience with immersion cooling and/or cold plate liquid cooling deployments or operations
- Strong working knowledge of Linux, Slurm, Kubernetes, Prometheus, Grafana, Ansible, and Python
- 5+ years of technical team leadership experience, with the ability to lead and grow teams in cross-cultural environments
- Excellent communication skills and strong customer-facing capability
- Fluent English is strongly preferred
Why Join PaleBlueDot AI
- Build cutting-edge AI infrastructure at global scale
- Work on high-impact, mission-critical AI compute environments
- Join a fast-moving team at the forefront of AI infrastructure innovation