We are a team to design, develop, maintain, and improve software for various ventures projects, i.e., projects that are adjacent to our core businesses and are bootstrapped fast with a lean team. You will be actively involved in the design of various components behind scalable applications, from frontend UI to backend infrastructure.
What you'll be doing
- Ensure entire stack is healthy: hardware, software, application and network are operating at optimal performance
- Perform deep dives into both systemic and latent reliability issues; partnering with other software and DevOps engineers across the organization to design, implement and roll out fixes
- Continuously improve availability, reliability, and observability and reduce the burden of human toil with tooling and automation
- Lead and drive SRE initiatives to improve operation efficiencies
- Represent the SRE team in system design reviews and operational readiness exercises for new and existing services
What you need
- Experience coding in Ruby and/or Go
- Familiar with GitOps principles and tools (Github Actions, Docker, Kubernetes)
- Experience in designing, analyzing, and troubleshooting large-scale distributed systems
- Curiosity about finding root causes in incidents and outages
- Ability to develop alignment to cultivate relationships and driving impact
- Mindset in designing fault tolerance system architecture
- Comfort with being uncomfortable in ambiguous situations
- Involvement with incident management and response
- Desire to grow expertise, inform, and educate others
- Capable to pick up various technologies, a fast learner and have a get things done mentality
- Humble to embrace better ideas from others, eager to make things better, open to challenges and possibilities
Desirable
- Familiar with cloud platforms and micro-service based architecture (AWS is big plus)
- Familiar with monitoring tools (e.g. Datadog, OpenTelemetry)
- Familiar with CICD tools (e.g. Github Actions)
- Familiar with IaC tools (e.g. Terraform, Spacelift)
- Experience in designing resilient system architecture
- Experience in optimizing performance of large-scale production system