Summary
We're looking for a Site Reliability Engineering (SRE) Manager with strong architectural experience to join the JMET SRE Team. You'll play a key role in leading SRE teams, designing and scaling reliable, secure, and high-performance infrastructure across our cloud and hybrid environments. You'll be responsible for establishing reliability patterns, driving large-scale systems design, and building automation frameworks to support production systems at scale.
Description
This is a hands-on leadership role with architectural ownership, strategic influence, and deep technical impact across multiple domains — including application and infrastructure security, incident response engineering, and resilience automation.
Responsibilities
- Architect Scalable Infrastructure: Design, evolve, and review highly reliable, performant, and cost-efficient cloud-native and hybrid infrastructure using Infrastructure as Code (IaC), containers, and microservices principles.
- Support Cryptographic Systems at Scale: Design and operationalize scalable, secure integrations with Hardware Security Modules (HSMs) for sensitive workloads, key management, and cryptographic operations.
- Drive SRE Best Practices: Define and implement service-level indicators (SLIs), objectives (SLOs), and agreements (SLAs) to guide engineering teams toward reliability and observability goals.
- Incident Architecture and Prevention: Serve as a technical lead during major incidents. Partner with security and platform teams to conduct thorough post-incident reviews, drive systemic improvements, and establish preventive architectural controls.
- System Design and Tooling: Build and maintain reusable tooling, automation frameworks, and reliability platforms — including observability, alerting, chaos testing, auto-scaling, and failover.
- Reliability as Code: Champion resilience engineering through automation pipelines, CI/CD integrations, canary releases, and chaos engineering principles.
- Multi-Cloud and Hybrid Systems: Design, assess, and guide architecture decisions across AWS, GCP, AliCloud, and on-premises infrastructure. Ensure consistency, interoperability, and regulatory compliance.
- Security and Compliance: Ensure architectural patterns align with security standards, compliance requirements, and audit readiness.
Minimum Qualifications
- 10 or more years of experience in SRE, DevOps, or Infrastructure Engineering roles, with 2 or more years in a managerial capacity.
- Deep expertise in cloud infrastructure (AWS, GCP, or AliCloud) and container orchestration (Kubernetes, EKS).
- Proven experience with Infrastructure as Code tools such as Terraform and CloudFormation.
- Strong understanding of distributed systems, networking, and systems design at scale.
- Proficiency in at least one programming or scripting language, such as Python, Go, or Bash.
Preferred Qualifications
- Solid background in CI/CD tools and modern deployment strategies, for example Spinnaker and GitOps.
- Familiarity with security best practices in cloud and containerized environments.
- Experience with HSMs and cryptographic operations at scale is a plus.
Apple is an equal opportunity employer that is committed to inclusion and diversity. Apple provides reasonable accommodations to applicants with disabilities and in accordance with local requirements. Apple is a drug-free workplace.