Senior Site Reliability Engineer (SRE / DevOps)

5-7 Years

SGD 7,000 - 10,000 per month

Save

Early Applicant

Job Description

Manage day-to-day operations, deployment, monitoring, and incident response for global gaming/livestreaming systems
Collaborate with engineering, QA, and product teams to quickly diagnose and resolve production issues, ensuring high service availability (SLA compliance)
Analyze system performance and optimize network quality across global regions
Oversee production database health: conduct routine inspections, manage backups and recovery, optimize slow queries, and plan for capacity
Implement and maintain monitoring and alerting systems to ensure infrastructure observability
Automate operational tasks and workflows using scripting languages such as Shell or Python
Support capacity planning, cost optimization, and disaster recovery preparedness
Participate in an on-call rotation to support 24/7 system uptime as needed

5-7 years of relevant experience in DevOps, SRE, or infrastructure operations in the internet, gaming, or livestreaming industry
Strong Linux system administration and troubleshooting skills
Proficient in infrastructure scripting (Shell/Python) and automation
Solid experience with production database management (e.g., MySQL/PostgreSQL), including tuning, scaling, and disaster recovery
Familiar with global cloud infrastructure providers such as AWS and AliCloud
Experience building network observability and monitoring systems for overseas markets
Working knowledge of container technologies (e.g., Docker, Kubernetes) and CI/CD pipelines
Experience supporting 24/7 mission-critical environments or participating in on-call duty is a strong advantage