- Monitor production systems using tools like Grafana and New Relic to detect performance issues and security vulnerabilities.
- Respond to live incidents and outages, perform root cause analysis, and drive postmortem documentation and learning.
- Maintain up-to-date operational runbooks for common issues and workflows.
A leading global gaming and technology company is seeking a highly capable Site Reliability Engineer (SRE) to join their team in Singapore. This is a mission-critical role where you'll own the reliability, scalability, and performance of complex distributed systems supporting a global platform. You'll work at the intersection of software development and operationsdesigning robust systems, responding to live incidents, and driving automation across infrastructure and CI/CD processes.
The Position
- Monitor production systems using tools like Grafana and New Relic to detect performance issues and security vulnerabilities.
- Respond to live incidents and outages, perform root cause analysis, and drive postmortem documentation and learning.
- Maintain up-to-date operational runbooks for common issues and workflows.
- Collaborate closely with developers to streamline production releases, patches, and deployment workflows.
- Manage infrastructure across cloud environments (primarily AWS), and optimize CI/CD pipelines for reliability and efficiency.
- Handle capacity planning, system performance tuning, and implement infrastructure-as-code using tools like Terraform.
The Candidate
- Comes from a backend or full-stack development background and is comfortable coding in languages such as Java, JavaScript/TypeScript, or Bash.
- Has experience running services at scale in cloud environments like AWS, with a strong understanding of Linux.
- Thinks like a software engineer, but with the mindset of an operatorproactively preventing outages and continuously improving systems.
- Is adept at debugging under pressure, analyzing logs/metrics, and communicating clearly during incidents.
- Is passionate about automation, observability, and creating self-healing systems.
Preferred Qualifications
- 3+ years of experience in site reliability engineering, DevOps, or software engineering roles.
- Proven skills in:
- Monitoring & alerting tools (Grafana, New Relic)
- CI/CD pipelines (Git, Jenkins, GitHub Actions, etc.)
- Container orchestration (Docker, Kubernetes)
- Infrastructure-as-code (Terraform, CloudFormation, Ansible)
- Managing and securing AWS environments
- Understanding of authentication/authorization protocols (OAuth, JWT, OpenID)
- Familiarity with SQL/NoSQL databases (PostgreSQL, Redis, MongoDB)
- Strong interpersonal skills and a collaborative approach to working with cross-functional teams.
We regret to inform that only shortlisted candidates will be notified / contacted.
EA Registration No: R22105541,
TAY ZHIHENG, DARIUS
Allegis Group Singapore Pte Ltd, Company Reg No. 200909448N, EA License No. 10C4544