High-Performance Computing (HPC) Engineer - Consultant

kuberox technologies

Singapore

5-10 Years

Save

Posted 2 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Job Summary

We are seeking a skilled High-Performance Computing (HPC) Engineer with 5–10 years of experience to design, deploy, manage, and optimize HPC cluster environments. The ideal candidate will have hands-on experience with cluster scheduling, monitoring, performance tuning, and supporting scientific or engineering workloads in Linux-based environments.

Key Responsibilities

Design, deploy, and maintain HPC cluster infrastructure to ensure high availability and performance.
Manage and configure job scheduling systems such as PBS and SLURM.
Implement and maintain monitoring solutions using Grafana, Nagios, Prometheus, and Ganglia.
Administer cluster management tools including Bright Cluster Manager, xCAT, and Puppet for infrastructure automation.
Configure and troubleshoot high-speed networking technologies including InfiniBand and Gigabit Ethernet.
Perform system performance analysis, profiling, and debugging using tools like Intel VTune, Valgrind, and gprof.
Provide application support for scientific and engineering workloads using GNU and Intel CUDA compilers, as well as MKL libraries.
Manage virtualization environments using Proxmox and handle license management tools like FlexLM.
Configure and maintain storage solutions including parallel file systems and enterprise object storage platforms.
Ensure system security, patching, and compliance in Red Hat Linux environments.
Collaborate with research, engineering, and IT teams to optimize workloads and resource utilization.
Document system architecture, processes, and troubleshooting guides.

Required Skills & Qualifications

5–10 years of experience in HPC systems administration or engineering.
Strong experience with job schedulers such as PBS and SLURM.
Hands-on experience with monitoring tools: Grafana, Nagios, Prometheus, Ganglia.
Expertise in cluster management tools like Bright Cluster Manager, xCAT, and Puppet.
Solid understanding of HPC networking, including InfiniBand and Ethernet.
Experience with performance profiling and debugging tools (Intel VTune, Valgrind, gprof).
Familiarity with compilers and libraries: GNU, Intel CUDA, MKL.
Experience with virtualization platforms like Proxmox and license management (FlexLM).
Knowledge of storage technologies: parallel file systems (e.g., Lustre, GPFS) and object storage.
Strong Linux administration skills, specifically Red Hat Enterprise Linux.
Scripting skills (Bash, Python, or similar) for automation and troubleshooting.