Senior Platform Engineer

FIRMUS METAL INTERNATIONAL PTE. LTD.

Shenton Way, Singapore

7-9 Years

SGD 10,000 - 16,000 per month

Save

Posted 13 days ago
Be among the first 10 applicants

Early Applicant

Job Description

ROLES AND RESPONSIBILITIES

Firmus Technologies is seeking a Senior Platform Engineer to join our Engineering and Technology team. You will drive the design and implementation of our MLOps capability. You will also collaborate with other engineers and make technical decisions on scaling Firmus AI factory platform engineering capabilities to planet scale, from IaC, container orchestration, observability, self-service portal to platform security. This role is ideal for a self-starter with passion for building things from first principles. You naturally break down complex problems into their fundamental truths to uncover novel and elegant solutions-rather than relying on conventional patterns.

KEY RESPONSIBILITIES

Build MLOps capabilities from the ground up, enabling reproducible, scalable, and secure ML workflows across internal and customer-facing environments.
Continuously improve our DevOps platform to ensure reliability, scalability, security, and seamless integration with CI/CD pipelines and infrastructure services.
Design, implement, operate and secure Kubernetes-based production infrastructure for high reliability, performance and security, including clusters supporting NVIDIA GB300 NVL72 systems with NVIDIA Quantum-X800 InfiniBand or Spectrum-X Ethernet.
Develop world-class observability platforms for internal and external customers to achieve Platinum tier recognition from SemiAnalysis.
Integrate Firmus central services with NVIDIA's software stack, including Mission Control, NETQ, UFM, and NMX.
Lead the enhancement and evangelism of internal platform products that provide cohesive, composable, secure-by-default, and low-friction self-service experiences that accelerates time to market and reduce engineers cognitive load.
Drive incident response efforts, participate actively in the on-call rotation, and lead detailed Root Cause Analysis (RCA) to continuously improve system reliability, operational maturity, and incident handling processes.

SKILLS AND EXPERIENCE

Bachelor's degree in computer science or a related technical field.
7+ years of experience as Platform Engineer, Site Reliability Engineer, DevOps engineer, MLOps Engineer or Observability Engineer.
Demonstrated strong proficiency: Infrastructure-as-Code, configuration management and CI/CD (e.g., Terraform, Ansible, GitHub Actions, Jenkins, ArgoCD).
Demonstrated strong proficiency: Containerization technologies (e.g., Docker), Kubernetes networking and cluster management, including upgrades and troubleshooting.
Demonstrated strong proficiency: Observability stack design and scaling (e.g., Loki, Grafana, Tempo, Prometheus, Thanos, ClickHouse).
Demonstrated strong proficiency: Telemetry solutions using various technology (e.g., Redfish, gNMI, SNMP, eBPF, streaming analytics).
Demonstrated strong proficiency: Unified telemetry collection with OpenTelemetry.
Demonstrated strong proficiency: Compliance automation (e.g., OPA, Kyverno).
Demonstrated strong proficiency: Competent in scripting and programming skills (e.g., Bash, Python, Go).
Demonstrated strong proficiency: Systems knowledge on Linux internals, networking stacks, and distributed storage.
Clear and effective English communication, written and spoken.
Bonus Points: Experience in high-growth startups or regulated industries with robust security and data privacy requirements, including SOC 2 Type 2 and ISO 27001.

At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.

Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.