System Engineer (HPC)
Location: Singapore
Employment Type: Contract (1 year renewable)
Certification Required:
- ITIL Foundation or equivalent or higher
- Red Hat Certified System Administrator (RHCSA) or equivalent or higher
Role Purpose
Client is seeking an experienced HPC Systems Engineer (or Senior HPC Systems Engineer, depending on experience) to support and operate large-scale Linux-based high-performance computing (HPC), storage, and networking environments. This role supports research scientists, academic users, and enterprise workloads, ensuring reliable, secure, and high-performance HPC operations.
Key Accountabilities
HPC Systems Operations:
- Administer, operate, and maintain Linux-based HPC clusters, including compute, storage, and high-speed networking.
Manage and Support:
- HPC job schedulers (e.g. Slurm, PBS Pro, LSF)
- Parallel file systems (Lustre, GPFS / Spectrum Scale, BeeGFS)
- Data Lake solutions (e.g. VAST)
- Hierarchical Storage Management (HSM) (e.g. Data Management Framework DMF)
- Cluster management and provisioning tools
- Perform system monitoring, patching, upgrades, and capacity planning.
- Troubleshoot and resolve hardware, software, OS, and network issues across HPC environments.
- Participate in on-call or escalation support rotations as required.
- Work with software engineers to support AI/DL applications and with desktop engineers to assist users as needed.
- Provide advice and guidance to researchers on HPC application development, debugging, optimization, and parallelization.
- Deliver HPC user training sessions and contribute to documentation and best-practice guides.
Key Performance Indicators
- Meet all SLA requirements for incident and service request handling.
- Comply with all policy and contract requirements.
Attributes:
- Strong analytical and troubleshooting skills
- Highly motivated and self-driven
- Collaborative team player
- Excellent written and verbal communication skills
- Ability to explain complex technical topics to non-technical usersCommitment to continuous learning and knowledge sharing