Senior DevOps / Site Reliability Engineer (SRE)

trulyyy

Singapore

5-7 Years

Save

Posted 18 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

About the Opportunity

We are partnering with a few fast-growing technology organization to hire experienced DevOps and Site Reliability Engineers. This role will focus on building scalable cloud infrastructure, improving system reliability, enhancing observability, and driving operational excellence across business-critical platforms.

Key Responsibilities

Reliability Engineering

Design, build and maintain highly available, scalable, and resilient production systems.
Define and implement service reliability standards, including SLIs, SLOs and operational best practices.
Lead incident response, root cause analysis, post-incident reviews, and reliability improvement initiatives.
Drive capacity planning, performance optimization, disaster recovery, and business continuity planning.

Cloud Infrastructure & Platform Engineering

Build and manage cloud-native infrastructure across AWS, Azure, GCP or hybrid environments.
Implement Infrastructure-as-Code (IaC) using tools such as Terraform, Ansible or Helm.
Design and maintain Kubernetes-based platforms and containerized workloads.
Improve platform scalability, security and operational efficiency.

Observability & Monitoring

Build and maintain enterprise monitoring, logging and alerting platforms.
Develop dashboards, metrics, alerting standards and operational visibility across services.
Support observability technologies such as:
Prometheus
Grafana
Datadog
ELK / OpenSearch
CloudWatch
Sentry

Automation & DevOps

Design and maintain CI/CD pipelines to support rapid and reliable software delivery.
Automate operational processes, deployment workflows and infrastructure management.
Improve engineering productivity through tooling, standardization and self-service platforms.

Performance & Scalability

Conduct performance testing, load testing and stress testing.
Identify system bottlenecks and implement optimization strategies.
Support high-volume distributed systems and microservices architectures.

Security & Operational Governance

Partner with security teams to implement secure infrastructure practices.
Support access management, secrets management, vulnerability remediation and compliance initiatives.
Promote operational excellence and reliability best practices across engineering teams.

Requirements

5+ years of experience in DevOps, Site Reliability Engineering, Platform Engineering or Infrastructure Engineering.
Strong hands-on experience supporting production systems in cloud environments.
Experience with Kubernetes, Docker and container orchestration technologies.
Strong Linux administration and troubleshooting skills.
Experience with at least one major cloud platform:
AWS
Azure
GCP
Proficiency in one or more programming or scripting languages:
Python
Go
Bash
JavaScript
Experience designing and maintaining CI/CD pipelines.
Strong problem-solving, debugging and root cause analysis capabilities.

Preferred Qualifications

Experience with large-scale distributed systems and microservices architectures.
Experience implementing observability and monitoring platforms.
Hands-on experience with Terraform, Ansible, Helm or other automation tools.
Experience supporting high-concurrency, high-availability systems.
Familiarity with technologies such as Kafka, Redis, Elasticsearch, MongoDB or similar distributed platforms.
Experience in internet, fintech, SaaS, cloud platform, gaming or technology-driven environments.
Experience collaborating with regional or globally distributed engineering teams.

TRULYYY PTE. LTD.

Senior Consultant

Yang Suyu

EA License No: 20S0118

EA Registration Number: R2199541