Lead and mentor a team of system engineers responsible for delivery, operations, escalations, and technical improvement.
Manage and optimise OS lifecycle for GPU and CPU nodes, including patching, kernel tuning, driver and firmware updates, configuration hardening, and automation.
Oversee bare-metal provisioning and deployment for GPU platforms, including NVIDIA stack components such as CUDA, drivers, NCCL, and container runtimes.
Manage Kubernetes (k8s) clusters supporting GPU workload orchestration, including autoscaling, scheduling, node health, multi-tenant resource isolation, and capacity allocation.
Run and enhance container platforms (Docker/CRI-O), including image management, registry security, runtime troubleshooting, and performance optimisation.
Integrate and operate monitoring and telemetry systems, such as DCGM, Prometheus, node exporters, Weka telemetry, and alert pipelines.
Drive continuous improvement in GPU utilisation efficiency, benchmarking, platform stability, and cost/performance optimisation.
Own operational workflows including incident, problem, and change management, RCA execution, and improvement tracking.
Lead capacity planning across compute, GPU, network, and storage layers to support scale-up and customer growth.
Maintain complete system documentation including SOPs, runbooks, KB articles, architecture diagrams, configuration standards, and platform records.
Oversee the ticketing lifecycle across internal operations, customer interfaces, and vendor escalation including RMA tracking and replacement management.
Ensure strong SLA alignment and customer interaction through accurate troubleshooting and triage across GPU, Kubernetes, and OS environments.
Support ISO27001 and SOC2 compliance through configuration standards, access controls, logging, vulnerability remediation, and platform security practices.
Maintain audit readiness and evidence collection for operational and security compliance.
Collaborate with vendors, partners, and engineering teams to resolve systemic GPU, container, or orchestration issues.
Support budgeting and forecasting related to GPU expansion, licensing, storage growth, and platform evolution.
Skills and Experience
Bachelor's degree in computer science, Engineering, or related discipline.
15+ years experience in solution architecture, cloud engineering, HPC, or AI infrastructure.
Deep hands-on experience with Linux systems, GPU platforms, Kubernetes orchestration, and container runtimes.
Strong technical knowledge across drivers, firmware, OS tuning, and performance benchmarking.
Practical experience supporting large-scale GPU clusters or HPC environments.
Practical experience with monitoring and telemetry platforms such as DCGM, Prometheus, Grafana, and Weka.
Good understanding of platform automation and infrastructure-as-code tooling (e.g., Ansible, Terraform).
Strong knowledge of troubleshooting processes across complex stack layers (OS, container, GPU, network, storage).
Excellent communication skills to work effectively across technical and non-technical stakeholders.
Strong documentation discipline and ability to translate technical concepts into clear written content.
Knowledge of ticketing platforms and RMA management processes in large-scale compute environments.
Excellent documentation and diagramming abilities.