Role Summary
The Site Reliability Engineer (SRE) ensures the reliability, availability, and performance of systems and platform services through a balance of engineering and operational excellence. The SRE applies software engineering principles to operations, using automation, monitoring, and data-driven analysis to improve reliability while enabling development velocity.
In the current structure, the SREs operate as both reliability owners and domain practitioners, supporting platform and product engineering teams across SRE and DevOps responsibilities. They are guided by a Senior Principal SRE, who provides organizational alignment, establishes common standards, and ensures consistency across teams.
Own end-to-end system reliability, availability, and performance using clearly defined SLAs, SLOs, and SLIs, with continuous monitoring and proactive improvement of service health.
- Establish and govern error budget policies in partnership with engineering leadership to balance release velocity with reliability, using error budgets to inform prioritization and release readiness decisions.
- Lead major and complex incident response efforts, collaborate during customer-impacting events, and drive blameless postmortems to ensure systemic corrective actions are implemented with urgency.
- Standardize and enhance observability across environments through robust monitoring, logging, and tracing frameworks using tools such as Dynatrace, CloudWatch, and OpenTelemetry.
Technical Skills (3-5 years relevant experience )
- Advance knowledge of core AWS services: EC2, ECS/EKS, Lambda, S3, RDS/Aurora, DynamoDB, VPC, ELB/ALB/NLB, Route53, IAM.
- Designing multi-AZ and multi-region highly available architectures.
- Strong understanding of networking in AWS (subnets, routing tables, NAT, security groups, NACLs, VPC peering, PrivateLink).
- Experience with well-architected framework pillars (especially reliability, security, cost optimization).
- Designing fault-tolerant and horizontally scalable systems
- Advanced proficiency in Terraform, CloudFormation, or CDK
- Hands-on experience with CloudWatch, Prometheus, Grafana, Datadog, Dynatrace, or OpenTelemetry
- Modular IaC design patterns and state management best practices.