Search by job, company or skills

palebluedot ai

AI Cluster Technology Director

10-12 Years
Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 14 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Overseas AI Cluster Operations & Delivery Lead

PaleBlueDot AI is looking for an experienced leader to drive the design, deployment, and operations of large-scale overseas AI compute clusters.

This role will take ownership of end-to-end delivery for 10,000-GPU-class AI infrastructure, covering architecture planning, deployment execution, operational readiness, customer technical engagement, and long-term performance optimization.

If you have deep experience in AI infrastructure, HPC, hyperscale data centers, and cross-cultural team leadership, this is an opportunity to help build mission-critical AI compute platforms at global scale.

What You'll Do

  • Lead the overall architecture design of overseas AI clusters across compute, networking, storage, and liquid cooling
  • Translate customer AI workload requirements into scalable, reliable, and high-performance technical solutions
  • Own the full deployment lifecycle, including planning, procurement, delivery acceptance, rack installation, cabling, commissioning, and production go-live
  • Drive execution across data center facilities teams, hardware vendors, liquid cooling partners, and other stakeholders
  • Build and lead a high-performing overseas local operations team, including 24/7 operations processes, SOPs, and incident response mechanisms
  • Establish monitoring, alerting, logging, and performance analysis systems to ensure full-stack observability and operational stability
  • Act as the senior technical escalation point for complex, high-impact incidents, leading troubleshooting, RCA, and long-term improvements
  • Work directly with customer technical teams on solution presentations, PoC support, and deep technical discussions
  • Ensure delivered services meet or exceed SLA expectations
  • Continuously optimize cluster efficiency, with focus on PUE, WUE, compute utilization, and operating cost

What We're Looking For

  • Bachelor's degree or above in Computer Science, Electronic Engineering, or a related field
  • 10+ years of experience in large-scale data center, HPC, or AI cluster operations and management
  • Proven experience delivering advanced overseas AI compute clusters or hyperscale data center projects
  • Deep expertise in AI cluster architecture, including environments such as NVIDIA DGX, SuperPOD, or GPU-as-a-Service
  • Strong understanding of InfiniBand, RoCE, distributed storage, and large-scale infrastructure design
  • Hands-on experience with immersion cooling and/or cold plate liquid cooling deployments or operations
  • Strong working knowledge of Linux, Slurm, Kubernetes, Prometheus, Grafana, Ansible, and Python
  • 5+ years of technical team leadership experience, with the ability to lead and grow teams in cross-cultural environments
  • Excellent communication skills and strong customer-facing capability
  • Fluent English is strongly preferred

Why Join PaleBlueDot AI

  • Build cutting-edge AI infrastructure at global scale
  • Work on high-impact, mission-critical AI compute environments
  • Join a fast-moving team at the forefront of AI infrastructure innovation

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 146998347