Description and Requirements

Job Summary

The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, and performance of enterprise systems in a managed services (Day 2) environment.

The role operates as a centralised reliability function across application, infrastructure, and vendor support layers, governing operational activities to ensure that incidents, changes, and patching activities are executed without impacting service stability or SLA commitments.

The SRE works closely with Application Engineers (L1.5) and Application Vendors (L2), providing oversight, risk control, and engineering-driven improvements to maintain a stable and resilient production environment.

Key Responsibilities

1. Reliability & Service Assurance

Own end-to-end service reliability, including availability, performance, and system stability
Define and track reliability metrics (e.g., uptime, latency, error rates)
Ensure SLA compliance through proactive monitoring and operational governance
Establish service health indicators and early warning mechanisms

2. Monitoring & Observability

Design and implement monitoring, logging, and alerting frameworks across application and infrastructure layers
Define alert thresholds and reduce alert noise to improve signal quality
Develop dashboards and reporting for real-time visibility of system health
Continuously enhance observability coverage across services

3. Incident Management & RCA

Lead major incident management (P1/P2) as incident commander
Perform end-to-end root cause analysis (RCA) across application, infrastructure, and vendor domains
Coordinate with Application Engineers and Vendors for issue resolution
Drive preventive and corrective actions to reduce incident recurrence

4. Change & Patch Governance

Assess operational risks associated with changes, releases, and patching activities
Work with Application Engineers (L1.5) and Vendors (L2) to ensure safe execution of application patches
Perform pre- and post-change validation to ensure system stability
Govern Go/No-Go decisions and support rollback planning in case of service degradation

5. Performance & Capacity Management

Monitor and optimise system performance across application and infrastructure layers
Conduct capacity planning and forecasting to ensure scalability and resilience
Identify and address performance bottlenecks proactively

6. Automation & Continuous Improvement

Drive automation of operational processes, including monitoring, recovery, and validation
Implement self-healing and resilience mechanisms where applicable
Develop and maintain operational runbooks and automation scripts
Continuously improve system reliability through engineering practices

7. Collaboration & Governance

Work closely with Application Engineers (L1.5) for execution of operational activities
Collaborate with Vendors (L2) for defect resolution and product-level fixes
Ensure compliance with governance, security, and audit requirements
Support service reviews, reporting, and continuous improvement initiatives

Requirements

Core Requirements

Experience in Site Reliability Engineering, DevOps, or production operations in enterprise environments
Strong understanding of cloud platforms (preferably AWS)
Experience with monitoring and observability tools
Strong troubleshooting capability across application and infrastructure layers
Experience in incident management and root cause analysis
Familiarity with ITIL processes (incident, problem, change management)

Preferred

Experience in system integrator or managed services (Day 2 operations) environment
Exposure to enterprise applications (e.g., IWMS platforms such as Archibus or similar)
Experience with automation and scripting (Python, PowerShell, etc.)
Knowledge of performance tuning and capacity planning

Key Competencies