Job Summary
We are seeking a skilled High-Performance Computing (HPC) Engineer with 5–10 years of experience to design, deploy, manage, and optimize HPC cluster environments. The ideal candidate will have hands-on experience with cluster scheduling, monitoring, performance tuning, and supporting scientific or engineering workloads in Linux-based environments.
Key Responsibilities
- Design, deploy, and maintain HPC cluster infrastructure to ensure high availability and performance.
- Manage and configure job scheduling systems such as PBS and SLURM.
- Implement and maintain monitoring solutions using Grafana, Nagios, Prometheus, and Ganglia.
- Administer cluster management tools including Bright Cluster Manager, xCAT, and Puppet for infrastructure automation.
- Configure and troubleshoot high-speed networking technologies including InfiniBand and Gigabit Ethernet.
- Perform system performance analysis, profiling, and debugging using tools like Intel VTune, Valgrind, and gprof.
- Provide application support for scientific and engineering workloads using GNU and Intel CUDA compilers, as well as MKL libraries.
- Manage virtualization environments using Proxmox and handle license management tools like FlexLM.
- Configure and maintain storage solutions including parallel file systems and enterprise object storage platforms.
- Ensure system security, patching, and compliance in Red Hat Linux environments.
- Collaborate with research, engineering, and IT teams to optimize workloads and resource utilization.
- Document system architecture, processes, and troubleshooting guides.
Required Skills & Qualifications
- 5–10 years of experience in HPC systems administration or engineering.
- Strong experience with job schedulers such as PBS and SLURM.
- Hands-on experience with monitoring tools: Grafana, Nagios, Prometheus, Ganglia.
- Expertise in cluster management tools like Bright Cluster Manager, xCAT, and Puppet.
- Solid understanding of HPC networking, including InfiniBand and Ethernet.
- Experience with performance profiling and debugging tools (Intel VTune, Valgrind, gprof).
- Familiarity with compilers and libraries: GNU, Intel CUDA, MKL.
- Experience with virtualization platforms like Proxmox and license management (FlexLM).
- Knowledge of storage technologies: parallel file systems (e.g., Lustre, GPFS) and object storage.
- Strong Linux administration skills, specifically Red Hat Enterprise Linux.
- Scripting skills (Bash, Python, or similar) for automation and troubleshooting.