Job Description
We are seeking an experienced Site Reliability Engineer (SRE) to support and enhance cloud-based applications and operational platforms. The ideal candidate will have strong experience in cloud infrastructure, CI/CD pipelines, automation, observability, reliability engineering, and production support within enterprise environments.
Key Responsibilities
- Manage and support cloud infrastructure, preferably on AWS.
- Build, maintain, and optimize CI/CD pipelines and deployment automation processes.
- Improve platform scalability, reliability, performance, and availability.
- Implement monitoring, logging, alerting, and observability solutions.
- Support DevSecOps practices including vulnerability remediation and security compliance.
- Perform incident management, root cause analysis (RCA), and service reliability improvements.
- Collaborate with development and infrastructure teams for production deployments and operational support.
- Contribute to system architecture, integration, and technical design discussions.
- Prepare and maintain technical documentation, SOPs, and operational runbooks.
- Identify operational risks and recommend mitigation strategies.
Requirements
- Minimum 10 years of experience in DevOps, SRE, or cloud engineering roles.
- Hands-on experience with AWS cloud services and infrastructure support.
- Strong experience with CI/CD tools and automation practices.
- Experience with monitoring and observability platforms.
- Familiarity with containerization and orchestration technologies.
- Strong troubleshooting and production support experience.
- Good communication and stakeholder management skills.
Preferred / Advantageous Skills
- Exposure to Infrastructure-as-Code tools such as Terraform or Cloud Formation.
- Experience with Kubernetes and container platforms.
- Knowledge of DevSecOps and cloud security best practices.
- Experience supporting high-availability or AR/VR digital platforms.
Technical Skills
- AWS Cloud Services
- Docker / Kubernetes
- Terraform / CloudFormation
- Jenkins / GitLab CI / GitHub Actions
- Linux Administration
- Prometheus / Grafana / ELK / Splunk
- Monitoring & Observability
- Incident Management & RCA
- DevSecOps Practices
- SRE Concepts (SLA / SLO / SLI)