Senior Site Reliability Engineer

Nicoll Curtin

Singapore

10-12 Years

Save

Posted 4 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Senior / Lead Site Reliability Engineer (Infrastructure & Storage)

We are hiring a Senior / Lead SRE to drive reliability and scalability for large-scale, globally distributed infrastructure platforms. This role sits at the intersection of systems engineering, storage platforms, and production reliability, supporting critical services used across multiple regions.

You will work closely with engineering counterparts across international locations, operating in a fast-paced, high-availability environment where performance, automation, and resilience at scale are key.

Key Responsibilities

Lead the design and evolution of high-scale distributed systems, ensuring reliability, efficiency, and fault tolerance
Own and optimise storage infrastructure, with a strong focus on HDFS and distributed data platforms
Drive end-to-end production reliability, including system performance tuning, capacity planning, and failure mitigation
Partner with global engineering teams to improve system resilience, observability, and deployment pipelines
Define and implement automation-first approaches to reduce manual operations and improve system efficiency
Lead incident response, root cause analysis, and postmortem processes in a fast-moving production environment
Participate in on-call rotations, ensuring high service availability and rapid issue resolution
Provide technical leadership, setting best practices for SRE, infrastructure, and operational excellence
Influence architecture decisions to support long-term scalability and growth

Requirements

Extensive experience (10+ years) in SRE, infrastructure engineering, or large-scale distributed systems
Strong track record in a technical leadership or staff/lead engineer role
Deep hands-on expertise with HDFS and large-scale storage systems
Strong understanding of distributed systems, fault tolerance, and high-availability design
Experience operating internet-scale or high-throughput systems in production environments
Proficiency in automation, scripting, and infrastructure tooling (e.g., Python, Go, Shell)
Familiarity with Linux systems, networking fundamentals, and performance tuning
Experience with cloud-native or hybrid infrastructure environments
Comfortable working in a globally distributed team across time zones
Willingness to be part of an on-call rotation

Nice to Have:

Strong background in storage engineering (candidates with deep storage expertise and less direct SRE experience are welcome)
Experience with big data ecosystems (e.g., Hadoop, Spark)
Exposure to containerisation and orchestration technologies (e.g., Kubernetes)
Mandarin language skills to support cross-regional collaboration