(Weekends and Shifts work may be necessary during employment period)
As a Site Reliability Engineer you will be filling a mission-critical role ensuring that our systems are healthy, monitored, automated, fault tolerant and designed to scale.
You will collaborate and work closely with engineering teams to continually improve our production services, facilitating fast delivery of new products, and reducing downtime.
Key Responsibilities:
- Drive Site Reliability Engineering agenda to improve availability, reliability, and performance of services
- Drive observability for our applications.
- Drive optimise-operate initiative, example, reduction of operation toil
- Work with application teams in setting up SLI, SLO and Error budget for their applications
- Work with enterprise team in deploying SRE enablers/initiatives.
- Any other ad-hoc duties as assigned by supervisors
Requirements:
- Degree in IT or related discipline
- Have a good understanding of ITIL & SRE processes & practices
- Have good leadership skills in working with application teams and service providers in defining infrastructure deployment plan, cutover/migration strategy and test plan.
- Able to formulae and establish infrastructure deployment standards.
- Good people management, vendor management and project management skills
- Agile, AWS certification preferred
- Able to create Bash/Python scripts for infra deployment
- Must able to practice SRE & Chaos Engineering principles
- Understands key SRE concepts such as Toil, SLI, SLO,Error Budgets, MTTD, MTTR, etc
Interested applicants, please email your resume to Karin Chan Wei Kien
Email:
CEI Reg No: R1104584
Recruit Express Pte Ltd
UEN: 199601303W
EA Licence No: 99C4599