Senior / Lead Site Reliability Engineer (Infrastructure & Storage)
We are hiring a Senior / Lead SRE to drive reliability and scalability for large-scale, globally distributed infrastructure platforms. This role sits at the intersection of systems engineering, storage platforms, and production reliability, supporting critical services used across multiple regions.
You will work closely with engineering counterparts across international locations, operating in a fast-paced, high-availability environment where performance, automation, and resilience at scale are key.
Key Responsibilities
- Lead the design and evolution of high-scale distributed systems, ensuring reliability, efficiency, and fault tolerance
- Own and optimise storage infrastructure, with a strong focus on HDFS and distributed data platforms
- Drive end-to-end production reliability, including system performance tuning, capacity planning, and failure mitigation
- Partner with global engineering teams to improve system resilience, observability, and deployment pipelines
- Define and implement automation-first approaches to reduce manual operations and improve system efficiency
- Lead incident response, root cause analysis, and postmortem processes in a fast-moving production environment
- Participate in on-call rotations, ensuring high service availability and rapid issue resolution
- Provide technical leadership, setting best practices for SRE, infrastructure, and operational excellence
- Influence architecture decisions to support long-term scalability and growth
Requirements
- Extensive experience (10+ years) in SRE, infrastructure engineering, or large-scale distributed systems
- Strong track record in a technical leadership or staff/lead engineer role
- Deep hands-on expertise with HDFS and large-scale storage systems
- Strong understanding of distributed systems, fault tolerance, and high-availability design
- Experience operating internet-scale or high-throughput systems in production environments
- Proficiency in automation, scripting, and infrastructure tooling (e.g., Python, Go, Shell)
- Familiarity with Linux systems, networking fundamentals, and performance tuning
- Experience with cloud-native or hybrid infrastructure environments
- Comfortable working in a globally distributed team across time zones
- Willingness to be part of an on-call rotation
Nice to Have:
- Strong background in storage engineering (candidates with deep storage expertise and less direct SRE experience are welcome)
- Experience with big data ecosystems (e.g., Hadoop, Spark)
- Exposure to containerisation and orchestration technologies (e.g., Kubernetes)
- Mandarin language skills to support cross-regional collaboration