Monitoring & Observability:
- Define and lead the enterprise observability strategy, ensuring complete visibility into application and infrastructure health.
- Oversee the architecture, engineering, and integration of events, logging, metrics, tracing, dashboards, alerting, and SLOs.
- Champion the adoption of modern observability tools (e.g., Grafana, Prometheus, Open Telemetry, Splunk, Datadog, etc.).
- Partner with application and infrastructure teams to implement observability best practices and optimize incident detection and root cause analysis.
SRE / DevOps Leadership:
- Lead global SRE and DevOps teams to implement scalable, resilient, and secure systems.
- Drive automation of operational tasks, reduce toil, and embed reliability into every stage of the software delivery lifecycle.
- Establish and maintain performance SLAs, SLIs, and SLOs aligned with business goals.
- Implement CI/CD pipelines and infrastructure-as-code practices to accelerate release cycles and improve developer productivity.
Tools, Automation and Software Engineering:
- Evaluate, select, and implement industry-leading tools for monitoring, incident management, and performance optimization to enhance system reliability and operational efficiency.
- Design and drive a comprehensive automation strategy to streamline deployment, configuration management, and incident response processes, reducing manual intervention and improving consistency.
- Champion the adoption of Infrastructure as Code practices to automate infrastructure provisioning and management, enabling scalable and repeatable deployments across global environments.
- Lead initiatives to enhance monitoring and alerting frameworks, ensuring proactive detection of issues and minimizing downtime through automated remediation processes.
- Build and implement software-based solutions to address infrastructure challenges, leveraging data analytics and an API-driven approach to enhance automation and improve operational efficiency.
Strategic Planning & Execution:
- Define and execute multi-year roadmaps for observability, DevOps, and reliability engineering initiatives.
- Collaborate with technology and business leaders to align platform engineering priorities with business strategy.
- Regularly present operational insights, reliability metrics, and transformation progress to executive stakeholders.
Transformation & Efficiency:
- Lead modernization of legacy systems and platforms, including cloud migration and tooling consolidation working with the functional leads.
- Introduce data-driven decision-making and performance benchmarking through advanced metrics and scorecards.
- Drive cost optimization and performance tuning initiatives across platforms and environments.
Team Leadership & Development:
- Build and scale high-performing, globally distributed engineering teams.
- Promote a culture of innovation, collaboration, continuous improvement, and engineering excellence.
- Mentor and develop technical leaders, fostering growth and career progression across the organization.
What qualifications or skills should you possess in this role
- 15+ years of experience in Technology, with 5+ years in a leadership role overseeing SRE, DevOps, or Observability functions.
- Proven expertise in monitoring and observability tools and frameworks, along with hands-on knowledge of cloud platforms (AWS, GCP, Azure).
- Experience implementing and scaling SRE practices, including automation, incident response, and performance optimization.
- Strong understanding of modern software delivery practices, CI/CD pipelines, and agile methodologies.
- Exceptional leadership, communication, and stakeholder management skills.
- Certifications in cloud or SRE practices are a plus.