- Global exposure and opportunities to work on cross-border projects
- High Leadership Visibility & Impact on Business Outcomes
About Our Client
A global leader renowned for innovative solutions, robust infrastructure, and driving digital transformation headquartered in Singapore.
Job Description
- Serve as the overall lead and the point of accountability for end-to-end GPUaaS and data centre operations, including operational reporting.
- Oversee day-to-day platform and facility operations across GPU hardware, networking, environmental systems, security controls, and supporting software.
- Lead and coordinate internal operations teams, vendors, and consultants during routine activities as well as critical incidents.
- Partner with engineering and external stakeholders to deliver platform upgrades and data centre improvement initiatives.
- Develop, review, and refine operational processes to maintain platform stability across compute, power, cooling, and infrastructure components.
- Take charge of major incidents, drive root cause analysis, and ensure clear, timely updates to customers and stakeholders.
- Provide regular updates to the management on operational performance, risks, and improvement plans.
- Ensure incidents are triaged and escalated appropriately based on severity, business impact, and SLA/SLO commitments.
- Build, lead, and motivate a strong operations team with a focus on accountability and continuous improvement.
- Set clear performance expectations, coach team members, and support ongoing professional development.
- Oversee security incident management and uphold security and compliance standards within the GPUaaS environment.
- Stay current with industry security developments and implement safeguards to protect customer workloads and platform integrity.
- Support scheduled maintenance activities and participate in on-call duties when required.
The Successful Applicant
- Bachelor's degree in Computer Science, Information Technology, or a related field.
- At least 8 years of experience in data centre operations, with a minimum of 3 years in a leadership capacity.
- Solid understanding of data centre infrastructure, including servers, networking, storage, and both physical and cybersecurity controls.
- Practical experience with electrical and mechanical systems, facilities management, and preventive maintenance practices.
- Demonstrated ability to lead teams and manage vendors effectively.
- Strong organisational skills with the ability to adapt to evolving operational demands.
- Hands-on experience with Linux and hypervisor administration in GPU or GPUaaS environments.
- Strong analytical and troubleshooting skills, with a proactive approach to performance optimisation and system reliability.
- Working knowledge of storage technologies, including capacity planning, troubleshooting, and data protection strategies.
- Experience managing GPU infrastructure, including configuration, monitoring, and performance tuning.
- Familiarity with liquid cooling technologies used in high-density GPU environments.
- Understanding of GPU cluster architectures and AI/HPC environments, including collective communications (e.g. NCCL, RDMA), high-performance networking (e.g. InfiniBand), and containerised or orchestrated platforms supporting AI and HPC workloads.
What's on Offer
As a growing firm with a tightly-knit team, the successful candidate will get the chance to contribute to a highly performing team while having the autonomy to make certain decisions for the team.
Contact
Winson Low (Lic No: R22106039/ EA no: 18C9065)
Quote job ref
JN-032026-6959635
Phone number
+65 6416 9865