Site Reliability Engineer

Allegis Group Singapore Pte Ltd

Singapore

7-10 Years

SGD 10,000 - 15,000 per month

This job is no longer accepting applications

Posted 3 months ago
Over 100 applicants

Job Description

OVERVIEW

We're hiring a Site Reliability Engineer to support a key global technology client. You'll join a modern, cloud‑native engineering environment and partner closely with development teams to improve the reliability, scalability, and automation of distributed platforms. The role blends software engineering with reliability ownership: you'll design and build internal services and tooling, streamline CI/CD, implement Infrastructure‑as‑Code at scale, and strengthen observability so issues are found and fixed before they impact users.

This position offers high autonomy and visibility. You'll work across well‑documented systems and established tooling, prepare proof‑of‑concepts to influence change, and drive pragmatic automation (in Go or Python) that reduces manual effort and makes releases safer and faster. If you enjoy hands‑on engineering, diagnosing complex problems, and landing improvements in real production environments, this is an opportunity to make a clear and measurable impact.

DESCRIPTION

As a Site Reliability Engineer, you will:

Build internal platforms, services, and APIs that enable self‑service provisioning, safe deployments, and efficient day‑to‑day operations.
Enhance CI/CD workflows (e.g., Jenkins or similar) to increase deployment reliability, add guardrails, and improve developer experience and velocity.
Implement and evolve Infrastructure‑as‑Code using Terraform (and related patterns) to standardize environments, reduce configuration drift, and improve repeatability.
Define and operationalize SLIs/SLOs and error budgets, build actionable dashboards, and tune alerts to reflect user experience and business risk.
Operate Kubernetes workloads at scale; improve resilience, performance, and cost‑efficiency through sound engineering and automation.
Strengthen observability (metrics, logs, traces) using Prometheus and complementary platforms; drive root‑cause analysis and preventative fixes.
Automate routine work and periodic upgrade cycles (preferably in Go/Python) to eliminate toil and reduce change risk.
Troubleshoot complex incidents across compute, networking, containers, and deployments; participate in a shared on‑call rotation and contribute to post‑incident reviews.
Collaborate with engineers, architects, and product stakeholders to translate requirements into secure, observable, and scalable infrastructure solutions.
Document patterns and best practices; mentor teams on reliability‑first ways of working and platform standards.

QUALIFICATIONS

Strong hands‑on experience with AWS (production environments) and cloud‑native architectures; familiarity with hybrid or multi‑cloud concepts is a plus.
Practical expertise operating Kubernetes (deployments, day‑2 operations, and troubleshooting).
Solid CI/CD skills with Jenkins or similar tools (pipeline design, release safety, rollbacks).
Proficiency in Infrastructure‑as‑Code (Terraform) and Git‑based workflows for environment management.
Programming/automation in Go and/or Python (production‑quality code; tooling and services, not just scripts).
Observability experience with Prometheus and dashboards/alerting tuned to SLIs/SLOs; familiarity with platforms such as Grafana, Datadog, or CloudWatch is welcome.
Networking fundamentals for distributed systems, DNS, load balancing, VPC design, security groups, and layer‑7 routing/proxies.
Sound understanding of secure system design (least privilege, secrets management, change control) and performance/reliability trade‑offs.
Excellent communication skills and the ability to operate independently in distributed, asynchronous teams while influencing stakeholders through clear proposals and POCs.
7+ years in SRE/DevOps/Infrastructure/Software Engineering with a track record of operating production‑grade systems at scale.

PROFESSIONAL ATTRIBUTES

Ownership: You're accountable across both build and run; you close the loop with measurable outcomes.
Automation first: You remove toil with durable solutions, not quick fixes.
Engineering rigor: You apply design patterns, testing, and code reviews to platform work.
Influence without authority: You use documentation, POCs, and calm communication to align teams.
Proactive and visible: You work independently across time zones and keep stakeholders informed.

We regret to inform that only shortlisted candidates will be notified / contacted.

EA Registration No: R21103843, Andrew Jonas Matthew

Allegis Group Singapore Pte Ltd, Company Reg No. 200909448N, EA License No. 10C4544

More Info

Job Type:

Contract Job

Role:

Other Roles

Function:

Others

About Company

Allegis Group Singapore Pte Ltd

Allegis Group is the global leader in talent solutions focused on working harder and caring more than any other provider. We'll go further to understand the needs of our people; our clients, our candidates, and our employees; and to consistently deliver on our promise of an unsurpassed quality experience. That's the Allegis Group difference, and it's consistent across every Allegis Group company. With more than US$11billion in annual revenues and over 500 locations across the globe, our network provides businesses with a comprehensive suite of talent solutions; without sacrificing the niche expertise required to ensure a successful partnership. Our specialised group of companies includes: Aerotek, TEKsystems, Allegis Global Solutions, Aston Carter, Major, Lindsey; Africa, Allegis Partners, MarketSource, and EASi. Visit www.AllegisGroup.com to learn more.

Allegis Group Singapore Pte Ltd,
Company Reg No. 200909448N, EA Licence No. 10C4544

Job ID: 143730709

Jobs by Skill - IT