About the Opportunity
We are partnering with a few fast-growing technology organization to hire experienced DevOps and Site Reliability Engineers. This role will focus on building scalable cloud infrastructure, improving system reliability, enhancing observability, and driving operational excellence across business-critical platforms.
Key Responsibilities
Reliability Engineering
- Design, build and maintain highly available, scalable, and resilient production systems.
- Define and implement service reliability standards, including SLIs, SLOs and operational best practices.
- Lead incident response, root cause analysis, post-incident reviews, and reliability improvement initiatives.
- Drive capacity planning, performance optimization, disaster recovery, and business continuity planning.
Cloud Infrastructure & Platform Engineering
- Build and manage cloud-native infrastructure across AWS, Azure, GCP or hybrid environments.
- Implement Infrastructure-as-Code (IaC) using tools such as Terraform, Ansible or Helm.
- Design and maintain Kubernetes-based platforms and containerized workloads.
- Improve platform scalability, security and operational efficiency.
Observability & Monitoring
- Build and maintain enterprise monitoring, logging and alerting platforms.
- Develop dashboards, metrics, alerting standards and operational visibility across services.
- Support observability technologies such as:
- Prometheus
- Grafana
- Datadog
- ELK / OpenSearch
- CloudWatch
- Sentry
Automation & DevOps
- Design and maintain CI/CD pipelines to support rapid and reliable software delivery.
- Automate operational processes, deployment workflows and infrastructure management.
- Improve engineering productivity through tooling, standardization and self-service platforms.
Performance & Scalability
- Conduct performance testing, load testing and stress testing.
- Identify system bottlenecks and implement optimization strategies.
- Support high-volume distributed systems and microservices architectures.
Security & Operational Governance
- Partner with security teams to implement secure infrastructure practices.
- Support access management, secrets management, vulnerability remediation and compliance initiatives.
- Promote operational excellence and reliability best practices across engineering teams.
Requirements
- 5+ years of experience in DevOps, Site Reliability Engineering, Platform Engineering or Infrastructure Engineering.
- Strong hands-on experience supporting production systems in cloud environments.
- Experience with Kubernetes, Docker and container orchestration technologies.
- Strong Linux administration and troubleshooting skills.
- Experience with at least one major cloud platform:
- AWS
- Azure
- GCP
- Proficiency in one or more programming or scripting languages:
- Python
- Go
- Bash
- JavaScript
- Experience designing and maintaining CI/CD pipelines.
- Strong problem-solving, debugging and root cause analysis capabilities.
Preferred Qualifications
- Experience with large-scale distributed systems and microservices architectures.
- Experience implementing observability and monitoring platforms.
- Hands-on experience with Terraform, Ansible, Helm or other automation tools.
- Experience supporting high-concurrency, high-availability systems.
- Familiarity with technologies such as Kafka, Redis, Elasticsearch, MongoDB or similar distributed platforms.
- Experience in internet, fintech, SaaS, cloud platform, gaming or technology-driven environments.
- Experience collaborating with regional or globally distributed engineering teams.
TRULYYY PTE. LTD.
Senior Consultant
Yang Suyu
EA License No: 20S0118
EA Registration Number: R2199541