Search by job, company or skills

EXASOFT PTE. LTD.

Site Reliability Engineer

3-5 Years
SGD 5,700 - 5,800 per month
new job description bg glownew job description bg glownew job description bg svg
  • Posted 13 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role Summary
The Site Reliability Engineer (SRE) ensures the reliability, availability, and performance of systems and platform services through a balance of engineering and operational excellence. The SRE applies software engineering principles to operations, using automation, monitoring, and data-driven analysis to improve reliability while enabling development velocity.

In the current structure, the SREs operate as both reliability owners and domain practitioners, supporting platform and product engineering teams across SRE and DevOps responsibilities. They are guided by a Senior Principal SRE, who provides organizational alignment, establishes common standards, and ensures consistency across teams.

Own end-to-end system reliability, availability, and performance using clearly defined SLAs, SLOs, and SLIs, with continuous monitoring and proactive improvement of service health.

  • Establish and govern error budget policies in partnership with engineering leadership to balance release velocity with reliability, using error budgets to inform prioritization and release readiness decisions.
  • Lead major and complex incident response efforts, collaborate during customer-impacting events, and drive blameless postmortems to ensure systemic corrective actions are implemented with urgency.
  • Standardize and enhance observability across environments through robust monitoring, logging, and tracing frameworks using tools such as Dynatrace, CloudWatch, and OpenTelemetry.

Technical Skills (3-5 years relevant experience )

  • Advance knowledge of core AWS services: EC2, ECS/EKS, Lambda, S3, RDS/Aurora, DynamoDB, VPC, ELB/ALB/NLB, Route53, IAM.
  • Designing multi-AZ and multi-region highly available architectures.
  • Strong understanding of networking in AWS (subnets, routing tables, NAT, security groups, NACLs, VPC peering, PrivateLink).
  • Experience with well-architected framework pillars (especially reliability, security, cost optimization).
  • Designing fault-tolerant and horizontally scalable systems
  • Advanced proficiency in Terraform, CloudFormation, or CDK
  • Hands-on experience with CloudWatch, Prometheus, Grafana, Datadog, Dynatrace, or OpenTelemetry
  • Modular IaC design patterns and state management best practices.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 142884035

Similar Jobs