Site Reliability Engineer (SRE)

trulyyy pte. ltd.

Singapore, Paya Lebar

5-7 Years

SGD 8,000 - 16,000 per month

Save

Posted 25 days ago
Be among the first 10 applicants

Early Applicant

Job Description

We are partnering with a fast-growing regional technology platform to hire an experienced Site Reliability Engineer (SRE) to support large-scale, high-availability internet systems across Southeast Asia.

This role will focus on system reliability, incident response, infrastructure optimization, and operational excellence in a high-traffic production environment.

Responsibilities

Ensure the stability, reliability, and performance of business-critical systems and applications
Manage application deployment, configuration changes, monitoring, capacity planning, and operational maintenance
Perform root cause analysis and troubleshooting for production incidents and critical system failures
Drive system reliability improvements including high availability, fault tolerance, disaster recovery, rate limiting, and service degradation mechanisms
Optimize system performance and critical service links through performance analysis and architecture improvements
Develop and maintain operational SOPs, incident response procedures, and disaster recovery plans
Establish and track SLO metrics and follow up on reliability improvement initiatives
Build and improve operational tooling, automation, and platformization to enhance operational efficiency and security
Collaborate closely with engineering and business teams to ensure smooth delivery and stable operations
Provide IT and infrastructure troubleshooting support for Singapore office network-related issues, including LAN/WIFI/connectivity troubleshooting with cloud vendors

Requirements

Minimum 5 years of experience in Site Reliability Engineering / DevOps / Infrastructure Operations within internet or technology companies
Strong troubleshooting and incident management experience in large-scale production environments
Familiar with JVM memory management and GC mechanisms, with ability to troubleshoot Java process-related issues
Hands-on experience with middleware and distributed systems including Nginx, Zookeeper, Kafka, RocketMQ, Redis, Memcache, Twemproxy, etc.
Familiar with monitoring and observability tools such as Grafana, Prometheus, Zabbix, etc.
Experience supporting high-concurrency, high-availability, and microservices architecture environments
Proficiency in at least one or two scripting/programming languages such as Python, Shell, Go, or Java
Experience in capacity planning, service governance, and end-to-end system reliability management is highly preferred
Familiarity with SRE operational frameworks and best practices is advantageous
Basic networking knowledge with ability to troubleshoot office and cloud-related networking issues
Strong analytical thinking, communication skills, and ability to work effectively under pressure
Ability to communicate effectively in Mandarin to support coordination with Mandarin-speaking stakeholders and regional technical teams
TRULYYY PTE. LTD.
Senior Consultant
Yang Suyu
EA License No: 20S0118
EA Registration Number: R2199541