Search by job, company or skills

D

HPC System Engineer

5-8 Years
SGD 9,000 - 18,000 per month
Save
new job description bg glownew job description bg glow
  • Posted 4 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Key Responsibilities

  • Design and develop compute cluster architectures optimized for performance, reliability, scalability, and serviceability within KLA systems.
  • Define and validate server hardware configurations, including CPUs, GPUs, memory subsystems, storage, networking, and specialized accelerators.
  • Analyze and optimize system-level performance across hardware and software layers, including CPU/GPU utilization, memory bandwidth, PCIe topology, NUMA architecture, and I/O performance.
  • Collaborate with hardware, software, firmware, and systems engineering teams to ensure seamless integration of compute clusters into broader system architectures.
  • Support server bring-up, hardware integration, diagnostics, benchmarking, stress testing, and root-cause analysis activities.
  • Manage and troubleshoot enterprise server platforms, including BIOS/firmware configuration, BMC/IPMI management, thermal and power optimization, and hardware health monitoring.
  • Participate in architecture reviews, integration planning, technical discussions, and cross-functional problem-solving sessions.
  • Create and maintain technical documentation for hardware design decisions, validation procedures, deployment standards, and troubleshooting workflows.

Required Skills & Qualifications

  • Strong experience in computer hardware and system architecture design, particularly in compute clusters, HPC environments, or enterprise server platforms.
  • Deep understanding of modern CPU and GPU architectures, including multicore processing, NUMA, PCIe, memory hierarchy, and hardware-software interactions.
  • Experience with GPU-accelerated systems and accelerator integration (e.g., NVIDIA GPU platforms, CUDA environments, or similar technologies).
  • Hands-on experience with Linux system administration and OS customization (preferably SUSE Linux Enterprise Server).
  • Familiarity with enterprise server management technologies such as BIOS/UEFI, BMC, IPMI, iDRAC, or similar remote management tools.
  • Understanding of distributed systems, high-performance networking, and cluster infrastructure technologies such as InfiniBand, RDMA, or high-speed Ethernet.
  • Experience with system performance tuning, hardware validation, benchmarking, and low-level troubleshooting.
  • Strong analytical, documentation, and communication skills.

Preferred Qualifications

  • Experience in high-performance computing (HPC), AI/ML infrastructure, or large-scale distributed compute environments.
  • Familiarity with server hardware bring-up, failure analysis, thermal/power optimization, and reliability engineering.
  • Exposure to hardware diagnostic and monitoring tools for server and cluster environments.
  • Understanding of storage architectures, parallel file systems, and distributed storage solutions.
  • Experience working in cross-functional engineering teams across hardware, firmware, and software domains.
  • Test-driven and detail-oriented engineering mindset with strong problem-solving skills.
  • Self-motivated individual with a proactive approach to continuous improvement and technical innovation.

More Info

Job Type:
Industry:
Employment Type:

Job ID: 147921069

Similar Jobs

Anson, Singapore

Skills:

Linux System AdministrationEthernet SwitchingSuse Linuxport mapping configurationinfinibandDIMMnetworking designserver or compute cluster hardware designHBMsystem-level performance optimizationLPDDRDDR5DDR4hardware-software interaction

Singapore

Skills:

FortranCluster management and provisioning toolsLinux-based HPC clustersOpen MPMPI