Search by job, company or skills

zig by comfortdelgro

DevOps Lead (Platform & Infrastructure Engineering)

7-9 Years
Save
  • Posted 13 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Responsibilities:

  • Maintain, refactor, and mature our existing AWS infrastructure and Terraform codebases, ensuring clean remote state management and modular scalability.
  • Drive proactive infrastructure upgrade cycles (e.g., AWS service deprecations, EKS platform version updates, and Terraform provider upgrades) to minimize technical debt and security risks.
  • Deeply analyze and optimize existing cloud spend—focusing on compute efficiency, storage lifecycles, and cloud-to-cloud/cross-AZ data transfer fees without compromising resilience.
  • Architect the transition of current systems toward next-generation platform engineering patterns (e.g., policy-as-code, self-service infrastructure portals) to support long-term organizational growth.
  • Maintain and optimize running production Amazon EKS clusters, managing zero-downtime upgrades, node group efficiencies, and ingress/egress configurations.
  • Own and optimize our ArgoCD implementation, ensuring strict configuration drift detection, clean synchronization policies, and multi-cluster environment alignment.
  • Continuously tune running containerized workloads by refining pod resource requests/limits, autoscaling policies, and cluster security boundaries.
  • Manage, secure, and scale our self-hosted Bitbucket Runners fleet, ensuring high availability, rapid build execution times, and compute cost-efficiency.
  • Maintain and continually optimize existing Bitbucket Pipelines to accelerate build-and-test feedback loops for developers while integrating automated compliance and security gates.
  • Maintain and enhance full-stack observability frameworks (metrics, logs, traces) to ensure deep visibility into distributed microservices.
  • Monitor system SLAs/SLIs/SLOs, eliminate alerting noise, and act as a senior technical escalation point for critical production incidents, driving comprehensive post-mortems and preventative remediation.
  • Guide, upskill, and mentor DevOps engineers, setting high engineering standards for operational tasks, infrastructure-as-code quality, and documentation.
  • Collaborate closely with software development leads to understand operational pain points, translating their feedback into long-term platform roadmap features.
  • Any ad hoc duties as assigned

Job Requirements:

  • Preferably 7+ years of dedicated experience in DevOps, Cloud Operations, or Site Reliability Engineering, with at least 2+ years leading engineering teams or overseeing complex production infrastructure environments.
  • Deep production experience maintaining core AWS environments (VPC, EC2, IAM, S3, RDS, Route53, CloudFront) at scale.
  • Strong proficiency in maintaining, refactoring, and upgrading large-scale Terraform codebases (multi-workspace, remote state management).
  • Hands-on experience operating production Kubernetes (EKS) and driving continuous delivery using ArgoCD in a live, multi-environment ecosystem is preferred.
  • Proven track record managing and troubleshooting self-hosted CI/CD runner fleets (specifically Bitbucket Runners) and optimizing delivery pipelines.
  • Competency in writing clean, maintainable scripts/tools using Python, Go, or Bash to automate operational runbooks and platform tasks.
  • Experience utilizing modern Kubernetes autoscalers like Karpenter for dynamic, cost-optimized EKS node provisioning.
  • Experience integrating automated compliance, dependency scanning, and vulnerability management (e.g., Trivy, Checkov, OPA) into existing runtime environments.
  • Certified Kubernetes Administrator (CKA) or AWS Certified DevOps Engineer – Professional will be a plus.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 149415101