Search by job, company or skills

T

Site Reliability Engineer (SRE)

5-7 Years
SGD 8,000 - 16,000 per month
Save
  • Posted 25 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

We are partnering with a fast-growing regional technology platform to hire an experienced Site Reliability Engineer (SRE) to support large-scale, high-availability internet systems across Southeast Asia.

This role will focus on system reliability, incident response, infrastructure optimization, and operational excellence in a high-traffic production environment.

Responsibilities

  • Ensure the stability, reliability, and performance of business-critical systems and applications

  • Manage application deployment, configuration changes, monitoring, capacity planning, and operational maintenance

  • Perform root cause analysis and troubleshooting for production incidents and critical system failures

  • Drive system reliability improvements including high availability, fault tolerance, disaster recovery, rate limiting, and service degradation mechanisms

  • Optimize system performance and critical service links through performance analysis and architecture improvements

  • Develop and maintain operational SOPs, incident response procedures, and disaster recovery plans

  • Establish and track SLO metrics and follow up on reliability improvement initiatives

  • Build and improve operational tooling, automation, and platformization to enhance operational efficiency and security

  • Collaborate closely with engineering and business teams to ensure smooth delivery and stable operations

  • Provide IT and infrastructure troubleshooting support for Singapore office network-related issues, including LAN/WIFI/connectivity troubleshooting with cloud vendors

Requirements

  • Minimum 5 years of experience in Site Reliability Engineering / DevOps / Infrastructure Operations within internet or technology companies

  • Strong troubleshooting and incident management experience in large-scale production environments

  • Familiar with JVM memory management and GC mechanisms, with ability to troubleshoot Java process-related issues

  • Hands-on experience with middleware and distributed systems including Nginx, Zookeeper, Kafka, RocketMQ, Redis, Memcache, Twemproxy, etc.

  • Familiar with monitoring and observability tools such as Grafana, Prometheus, Zabbix, etc.

  • Experience supporting high-concurrency, high-availability, and microservices architecture environments

  • Proficiency in at least one or two scripting/programming languages such as Python, Shell, Go, or Java

  • Experience in capacity planning, service governance, and end-to-end system reliability management is highly preferred

  • Familiarity with SRE operational frameworks and best practices is advantageous

  • Basic networking knowledge with ability to troubleshoot office and cloud-related networking issues

  • Strong analytical thinking, communication skills, and ability to work effectively under pressure

  • Ability to communicate effectively in Mandarin to support coordination with Mandarin-speaking stakeholders and regional technical teams

    TRULYYY PTE. LTD.

    Senior Consultant

    Yang Suyu

    EA License No: 20S0118

    EA Registration Number: R2199541

More Info

Job Type:
Industry:
Function:
Employment Type:

Job ID: 147456059