Senior Site Reliability Engineer

landi international (singapore) pte. ltd.

Singapore, Beach Road

7-9 Years

SGD 8,000 - 11,000 per month

Save

Posted 2 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

As a Senior Site Reliability Engineer at LANDI Global, you will play a critical role in defining and advancing the reliability, scalability, and performance of our platform infrastructures. You will workclosely with cross-functional teams to establish reliability standards, drive automation strategy, and lead continuous improvement initiatives across our environments.

Infrastructure & Platform Operations

Design, build, and optimize LANDI Global's platform infrastructures across development, staging, and production environments, with a focus on scalability and resilience.
Collaborate with R&D and platform teams to define architecture patterns and reliability standards that ensure availability and operational excellence.
Lead platform readiness for new client onboarding, ensuring scalability, repeatability, and operational sustainability.

Monitoring, Reliability & Incident Management

Define and drive improvements in monitoring, logging, and alerting systems to ensure high signal quality and proactive issue detection.
Lead incident response for high-severity events, and drive high-quality root cause analysis (RCA) with a focus on systemic improvements.
Design, evolve, and validate Disaster Recovery (DR) and business continuity strategies, ensuring systems meet recovery objectives.
Participate in and help evolve the 24/7 standby model to improve operational effectiveness and sustainability.

Performance, Optimization & Automation

Analyze platform performance metrics and lead optimization strategies across cloud and on-prem environments.
Drive improvements in automated testing, CI/CD pipelines, and deployment workflows to enhance release safety, speed, and reliability.
Identify and eliminate operational toil through automation and engineering solutions.
Establish and standardize operational runbooks and procedures across services.

Operational Support

Provide advanced troubleshooting and support for complex production issues, guiding teams toward effective resolution.
Lead continuous improvement initiatives to enhance platform resilience, scalability, and operational efficiency.
Act as a key escalation point for critical platform issues and reliability concerns.

Technical Leadership & Collaboration

Mentor Associate SREs and SREs through guidance, reviews, and knowledge sharing.
Influence engineering teams without direct authority to adopt best practices in reliability and operations.
Act as a bridge between SRE, platform, and R&D teams to align on scalable and sustainable engineering practices.

REQUIREMENTS& QUALIFICATIONS

Bachelor's degree in Computer Science, Software Engineering or a related field.
Minimum 7 years of experience as a Site Reliability Engineer, DevOps Engineer, or in a similar role.
Strong verbal and written communication skills in English and Mandarin.
Strong experience designing and operating distributed systems at scale
Proven ability to improve reliability across multiple services or platforms
Deep understanding of system failure modes, scalability, and performance trade-offs
Experience defining and implementing SLOs, SLIs, and observability practices
Ability to lead incident response and drive systemic improvements
Strong communication skills with the ability to influence without authority