Site Reliability Engineer

TEKsystems

Early Applicant

Posted 11 days ago
Be among the first 10 applicants

3-5 Years

Singapore

Job Description

Monitor production systems using tools like Grafana and New Relic to detect performance issues and security vulnerabilities.
Respond to live incidents and outages, perform root cause analysis, and drive postmortem documentation and learning.
Maintain up-to-date operational runbooks for common issues and workflows.

A leading global gaming and technology company is seeking a highly capable Site Reliability Engineer (SRE) to join their team in Singapore. This is a mission-critical role where you'll own the reliability, scalability, and performance of complex distributed systems supporting a global platform. You'll work at the intersection of software development and operationsdesigning robust systems, responding to live incidents, and driving automation across infrastructure and CI/CD processes.

The Position

Monitor production systems using tools like Grafana and New Relic to detect performance issues and security vulnerabilities.
Respond to live incidents and outages, perform root cause analysis, and drive postmortem documentation and learning.
Maintain up-to-date operational runbooks for common issues and workflows.
Collaborate closely with developers to streamline production releases, patches, and deployment workflows.
Manage infrastructure across cloud environments (primarily AWS), and optimize CI/CD pipelines for reliability and efficiency.
Handle capacity planning, system performance tuning, and implement infrastructure-as-code using tools like Terraform.

The Candidate

Comes from a backend or full-stack development background and is comfortable coding in languages such as Java, JavaScript/TypeScript, or Bash.
Has experience running services at scale in cloud environments like AWS, with a strong understanding of Linux.
Thinks like a software engineer, but with the mindset of an operatorproactively preventing outages and continuously improving systems.
Is adept at debugging under pressure, analyzing logs/metrics, and communicating clearly during incidents.
Is passionate about automation, observability, and creating self-healing systems.

Preferred Qualifications

3+ years of experience in site reliability engineering, DevOps, or software engineering roles.
Proven skills in:
Monitoring & alerting tools (Grafana, New Relic)
CI/CD pipelines (Git, Jenkins, GitHub Actions, etc.)
Container orchestration (Docker, Kubernetes)
Infrastructure-as-code (Terraform, CloudFormation, Ansible)
Managing and securing AWS environments
Understanding of authentication/authorization protocols (OAuth, JWT, OpenID)
Familiarity with SQL/NoSQL databases (PostgreSQL, Redis, MongoDB)
Strong interpersonal skills and a collaborative approach to working with cross-functional teams.

We regret to inform that only shortlisted candidates will be notified / contacted.

EA Registration No: R22105541, TAY ZHIHENG, DARIUS

Allegis Group Singapore Pte Ltd, Company Reg No. 200909448N, EA License No. 10C4544