As a Senior Site Reliability Engineer at LANDI Global, you will play a critical role in defining and advancing the reliability, scalability, and performance of our platform infrastructures. You will workclosely with cross-functional teams to establish reliability standards, drive automation strategy, and lead continuous improvement initiatives across our environments.
Infrastructure & Platform Operations
- Design, build, and optimize LANDI Global's platform infrastructures across development, staging, and production environments, with a focus on scalability and resilience.
- Collaborate with R&D and platform teams to define architecture patterns and reliability standards that ensure availability and operational excellence.
- Lead platform readiness for new client onboarding, ensuring scalability, repeatability, and operational sustainability.
Monitoring, Reliability & Incident Management
- Define and drive improvements in monitoring, logging, and alerting systems to ensure high signal quality and proactive issue detection.
- Lead incident response for high-severity events, and drive high-quality root cause analysis (RCA) with a focus on systemic improvements.
- Design, evolve, and validate Disaster Recovery (DR) and business continuity strategies, ensuring systems meet recovery objectives.
- Participate in and help evolve the 24/7 standby model to improve operational effectiveness and sustainability.
Performance, Optimization & Automation
- Analyze platform performance metrics and lead optimization strategies across cloud and on-prem environments.
- Drive improvements in automated testing, CI/CD pipelines, and deployment workflows to enhance release safety, speed, and reliability.
- Identify and eliminate operational toil through automation and engineering solutions.
- Establish and standardize operational runbooks and procedures across services.
Operational Support
- Provide advanced troubleshooting and support for complex production issues, guiding teams toward effective resolution.
- Lead continuous improvement initiatives to enhance platform resilience, scalability, and operational efficiency.
- Act as a key escalation point for critical platform issues and reliability concerns.
Technical Leadership & Collaboration
- Mentor Associate SREs and SREs through guidance, reviews, and knowledge sharing.
- Influence engineering teams without direct authority to adopt best practices in reliability and operations.
- Act as a bridge between SRE, platform, and R&D teams to align on scalable and sustainable engineering practices.
REQUIREMENTS& QUALIFICATIONS
- Bachelor's degree in Computer Science, Software Engineering or a related field.
- Minimum 7 years of experience as a Site Reliability Engineer, DevOps Engineer, or in a similar role.
- Strong verbal and written communication skills in English and Mandarin.
- Strong experience designing and operating distributed systems at scale
- Proven ability to improve reliability across multiple services or platforms
- Deep understanding of system failure modes, scalability, and performance trade-offs
- Experience defining and implementing SLOs, SLIs, and observability practices
- Ability to lead incident response and drive systemic improvements
- Strong communication skills with the ability to influence without authority