Role Summary
The Site Reliability Engineer (SRE) ensures the reliability, availability, and performance of systems and platform services through a balance of engineering and operational excellence. The SRE applies software engineering principles to operations, using automation, monitoring, and data-driven analysis to improve reliability while enabling development velocity.
Technical Skills
- Advance knowledge of core AWS services: EC2, ECS/EKS, Lambda, S3, RDS/Aurora, DynamoDB, VPC, ELB/ALB/NLB, Route53, IAM.
- Designing multi-AZ and multi-region highly available architectures.
- Strong understanding of networking in AWS (subnets, routing tables, NAT, security groups, NACLs, VPC peering, PrivateLink).
- Experience with well-architected framework pillars (especially reliability, security, cost optimization).
- Designing fault-tolerant and horizontally scalable systems
- Advanced proficiency in Terraform, CloudFormation, or CDK
- Hands-on experience with CloudWatch, Prometheus, Grafana, Datadog, Dynatrace, or OpenTelemetry
- Modular IaC design patterns and state management best practices.