Responsibilities:
Observability and Monitoring:
- Design, implement, and manage comprehensive observability solutions, including monitoring, logging, and tracing, using industry-standard tools (e.g., Prometheus, Grafana, ELK Stack, Datadog, New Relic, Splunk)
- Develop and maintain dashboards, alerts, and reports to provide real-time insights into system health, application performance, and key business metrics
- Instrument applications and infrastructure to collect relevant data for monitoring and analysis
Application Support and Troubleshooting:
- Provide expert-level support for production applications, including troubleshooting and resolving complex incidents, performance bottlenecks, and system outages
- Collaborate with development teams to ensure applications are designed for observability and supportability
- Participate in on-call rotations to ensure timely resolution of production issues
Trend Analysis and Performance Optimization:
- Conduct proactive trend analysis of system and application performance data to identify patterns, anomalies, and potential risks
- Perform thematic studies to analyze performance trends, identify root causes of issues, and recommend optimization strategies
- Develop and implement solutions to improve system performance, scalability, and efficiency
Automation (very important):
- Develop and maintain automation scripts and tools to streamline observability workflows, incident response, and system maintenance tasks
- Automate the deployment and configuration of observability infrastructure
- Identify opportunities for automation to reduce manual effort and improve operational efficiency
Banking/Financial Sector Focus:
- Ensure compliance with relevant regulatory requirements (e.g., IM8) and industry best practices specific to the banking/financial sector
- Participate in audits and provide necessary data and documentation related to system performance and availability
- Understand the unique challenges and requirements of maintaining highly available and secure financial systems
Collaboration and Communication:
- Work closely with cross-functional teams, including development, operations, and security, to ensure seamless integration of observability practices
- Communicate technical issues and solutions effectively to both technical and non-technical stakeholders
- Document procedures, best practices, and knowledge base articles
Qualifications:
- Bachelor's degree in Computer Science, Information Technology or a related field
- 8 + years of experience in an Observability, SRE, or similar role
- Strong experience working within the banking or financial sector
- Deep understanding of observability principles and best practices - strong knowledge on AppDynamics, Dynatrace, APM tools & Open Telemetry
- Proficiency with observability tools such as Prometheus, Grafana, ELK Stack, Datadog, New Relic, etc
- Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes)
- Experience with scripting and automation languages (e.g., Python, Bash, etc.)
- Experience with cloud platforms e.g., AWS, Azure, GCP. Familiarity with Government Commercial Cloud (GCC) is a strong advantage
- Knowledge of Site Reliability Engineering (SRE) principles and practices
- Experience with Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible)
- Relevant certifications (e.g., AWS Certified DevOps Engineer, etc.)
- Strong analytical, problem-solving, and troubleshooting skills
- Excellent communication and collaboration abilities
- Ability to thrive in a fast-paced, demanding environment