
Search by job, company or skills
We are looking for a proactive and skilled Site Reliability Engineer (SRE) with hands-on experience in Elastic Stack (Elasticsearch, Logstash, Kibana, Beats) to join our engineering team. The ideal candidate will be responsible for ensuring the reliability, scalability, and performance of our critical systems and services, with a focus on managing and optimizing Elastic deployments. This role combines software engineering and systems engineering to build and maintain highly available infrastructure.
Design, deploy, and maintain scalable and highly available Elastic Stack environments to support logging, monitoring, and analytics needs.
Monitor system health, performance, and capacity of Elastic clusters and related infrastructure, proactively identifying and resolving issues.
Automate deployment, configuration, and management of Elastic components using infrastructure-as-code tools and scripting.
Collaborate with development, operations, and security teams to ensure reliability, security, and compliance of Elastic-based solutions.
Develop and maintain monitoring, alerting, and incident response processes to minimize downtime and improve system resilience.
Troubleshoot and resolve complex issues related to Elastic performance, indexing, query optimization, and cluster stability.
Participate in on-call rotations to provide 24/7 support for critical systems and respond to incidents promptly.
Continuously improve system reliability through capacity planning, performance tuning, and root cause analysis.
Document system architecture, operational procedures, and best practices related to Elastic and overall infrastructure.
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience.
3+ years of experience as a Site Reliability Engineer, DevOps engineer, or similar role.
Strong hands-on experience with Elastic Stack components: Elasticsearch, Logstash, Kibana, and Beats.
Proficient in Linux system administration and networking fundamentals.
Experience with automation and configuration management tools such as Ansible, Terraform, or similar.
Solid scripting skills in languages like Python, Bash, or Go.
Familiarity with containerization (Docker) and orchestration platforms (Kubernetes) is a plus.
Experience with monitoring and alerting tools such as Prometheus, Grafana, or equivalent.
Strong problem-solving skills and ability to work under pressure during incidents.
Job ID: 143487263