We are seeking a Site Reliability Engineer to work closely with IT teams in providing both operational and project support. The ideal candidate will play a key role in ensuring system reliability, improving observability, automating operational tasks, and supporting critical production environments.
This role requires strong technical expertise, effective communication skills, and the ability to collaborate well with both internal teams and external partners.
Key Responsibilities:
1. Observability and Proactive Monitoring
- Monitor log files, system health, and application performance to ensure service reliability and availability.
- Install, configure, and manage monitoring tools.
- Implement, enhance, and integrate monitoring solutions to enable proactive monitoring and improve business and operational processes.
- Analyze monitoring data and generate dashboards and reports to provide operational insights and support decision-making.
2. Automation of Day-to-Day Operational Activities (20%)
- Automate routine operational tasks using tools and scripting languages such as:
- Ansible
- Jenkins
- Shell scripting
- PowerShell
- Python
3. Reliability Engineering Support (20%)
- Support service requests related to the reliability engineering function.
- Provide operational support for reliability-related activities.
- Participate in after-office-hours support when required, including:
- Immediate response and resolution of production incidents
- Support for project cutovers and implementation activities
Requirements:
- Minimum 3 years of experience in IT operations, automation, and monitoring solutions.
- At least 2 years of scripting experience, preferably with:
- Ansible
- Shell scripting
- Python
- Familiarity with platforms and technologies such as:
- Windows
- Linux
- Unix
- Cloud platforms (preferably AWS)
- Databases
- Middleware
- Certified in Site Reliability Engineering (SRE) or an equivalent certification.
- Strong communication and interpersonal skills, with the ability to work effectively with both internal and external stakeholders.
Preferred Skills:
- Experience in proactive incident monitoring and operational support.
- Strong analytical skills with the ability to interpret system data and produce meaningful insights.
- Hands-on experience in automation and process improvement.
- Ability to work in a fast-paced environment and provide support for critical systems.
To apply,simply click the Apply button or send your updated profile to [Confidential Information]
EA Licence No.:18S9405 / EA Reg. No.:R1330864
PerceptSolutions is expanding and actively seeking talented individuals. We encourageapplicants to follow Percept Solutions on LinkedIn at https://www.linkedin.com/company/percept-solutions/to stay informed about new opportunities and events.