Participate in the full lifecycle of HPC cluster ops from system bring-up-and-down, workload characterisation and optimisation, and rollout of new AI and Software Services.
Design & operate a GPU orchestration layer with high availability and utilisation for AI training, inference and other scientific workloads.
Partner with other DSO engineers to design standards, automate operations, and translate research code into performant workloads on distributed systems.
Maintain hardware infrastructure, distributed storage, high speed networking and supporting IT infrastructure and support maintenance and upgrades.
Job Requirements
Degree in Computer Science & Engineering / Software Engineering / Artificial Intelligence or any other related field
Minimum 2-year experience in IT Infrastructure or related field. More experience candidates may be considered for senior role.
Strong proficiency in Linux environments, computer architecture, and Python / Bash scripting for tooling and automation.
Working proficiency of Kubernetes container orchestration and infrastructure provisioning/management software (e.g. Ansible, Terraform) for fleet automation.
Experience with NVIDIA GPUs, GitOps, Infra CI/CD, networking protocols, and other AI infrastructure technologies will be advantageous.
Strong written and verbal communication to lead vendor and cross-functional engagements and/or performance analysis and troubleshooting initiatives.