Lead Reliability: Own the site reliability process from design to deployment, ensuring our systems are robust, scalable, and secure.
Teach: Guide the software engineering team on best practices and work together to evolve our engineering processes.
Mastermind infrastructure: Manage and optimize multiple Kubernetes clusters across environments and build core services that drive our platform.
Incident Response: Resolve production incidents, ensuring minimal downtime and high performance.
Team Leadership: Help build and lead a high-velocity, adaptable infrastructure engineering team.
Requirements
SRE Expertise: Deep experience in site reliability engineering within multi-datacenter cloud environments with high demands on uptime, performance, and security.
Technical Acumen: Strong background in AWS, Kubernetes, docker, terraform and software engineering, with the ability to adopt and integrate open-source and commercial technologies.
Leadership: Proven experience in leading teams and projects, with the ability to work closely with senior management to prioritize and allocate resources effectively.
Advanced Infrastructure: Experience with cloud infrastructure, low-latency systems, HashiCorp tools, and multi-cloud environments.
Cloud Security: Expertise in securing complex cloud network architectures.