Maintain, refactor, and mature our existing AWS infrastructure and Terraform codebases, ensuring clean remote state management and modular scalability.
Drive proactive infrastructure upgrade cycles (e.g., AWS service deprecations, EKS platform version updates, and Terraform provider upgrades) to minimize technical debt and security risks.
Deeply analyze and optimize existing cloud spend—focusing on compute efficiency, storage lifecycles, and cloud-to-cloud/cross-AZ data transfer fees without compromising resilience.
Architect the transition of current systems toward next-generation platform engineering patterns (e.g., policy-as-code, self-service infrastructure portals) to support long-term organizational growth.
Maintain and optimize running production Amazon EKS clusters, managing zero-downtime upgrades, node group efficiencies, and ingress/egress configurations.
Own and optimize our ArgoCD implementation, ensuring strict configuration drift detection, clean synchronization policies, and multi-cluster environment alignment.
Continuously tune running containerized workloads by refining pod resource requests/limits, autoscaling policies, and cluster security boundaries.
Manage, secure, and scale our self-hosted Bitbucket Runners fleet, ensuring high availability, rapid build execution times, and compute cost-efficiency.
Maintain and continually optimize existing Bitbucket Pipelines to accelerate build-and-test feedback loops for developers while integrating automated compliance and security gates.
Maintain and enhance full-stack observability frameworks (metrics, logs, traces) to ensure deep visibility into distributed microservices.
Monitor system SLAs/SLIs/SLOs, eliminate alerting noise, and act as a senior technical escalation point for critical production incidents, driving comprehensive post-mortems and preventative remediation.
Guide, upskill, and mentor DevOps engineers, setting high engineering standards for operational tasks, infrastructure-as-code quality, and documentation.
Collaborate closely with software development leads to understand operational pain points, translating their feedback into long-term platform roadmap features.
Any ad hoc duties as assigned
Job Requirements:
Preferably 7+ years of dedicated experience in DevOps, Cloud Operations, or Site Reliability Engineering, with at least 2+ years leading engineering teams or overseeing complex production infrastructure environments.
Deep production experience maintaining core AWS environments (VPC, EC2, IAM, S3, RDS, Route53, CloudFront) at scale.
Strong proficiency in maintaining, refactoring, and upgrading large-scale Terraform codebases (multi-workspace, remote state management).
Hands-on experience operating production Kubernetes (EKS) and driving continuous delivery using ArgoCD in a live, multi-environment ecosystem is preferred.
Proven track record managing and troubleshooting self-hosted CI/CD runner fleets (specifically Bitbucket Runners) and optimizing delivery pipelines.
Competency in writing clean, maintainable scripts/tools using Python, Go, or Bash to automate operational runbooks and platform tasks.
Experience utilizing modern Kubernetes autoscalers like Karpenter for dynamic, cost-optimized EKS node provisioning.
Experience integrating automated compliance, dependency scanning, and vulnerability management (e.g., Trivy, Checkov, OPA) into existing runtime environments.
Certified Kubernetes Administrator (CKA) or AWS Certified DevOps Engineer – Professional will be a plus.