The Role:
The Senior Cloud Operations Engineer is responsible for the architectural stability, scalability, and security of the company's mission-critical cloud infrastructure. This role focuses on high-availability (HA) management for financial-grade systems, leveraging automation and cloud-native technologies to ensure seamless service delivery. You will bridge the gap between development and operations by building robust internal tools and monitoring ecosystems.
Job Responsibilities:
- Multi-Cloud Management: Lead the maintenance and architectural optimization of the company's public cloud platforms (Primary: AWS, Secondary: GCP).
- Financial-Grade Reliability: Ensure core business systems meet financial-level SLA requirements, including disaster recovery, high-availability design, and routine stress testing.
- Infrastructure as Code (IaC): Manage and optimize internal business systems, including Jira, Confluence, Docker registries, and Kubernetes (K8s) clusters.
- Internal Tooling Development: Design, develop, and maintain internal automation systems using Golang, including unified monitoring/alerting platforms, centralized logging centers, and CI/CD pipelines.
- Security & Compliance: Manage cloud resources, IAM accounts, and user permissions to meet strict financial security auditing and compliance standards.
- Incident Management: Lead troubleshooting for complex network and system issues, participating in on-call rotations to ensure 24/7 system uptime.
Job Requirement:
- Education: Bachelor's degree in Computer Science, Software Engineering, or a related technical field.
- Experience: 4-5 years of solid experience in Cloud Operations, DevOps, or SRE roles.
- Cloud Proficiency: Expert knowledge of Linux administration and AWS ecosystem (EC2, VPC, EKS, RDS, IAM).
- Programming: Proficient in Golang (essential for tool development) and scripting languages such as Python and Shell.
- Containerization: Extensive hands-on experience in managing and scaling production Kubernetes (K8s) environments.
- Observability: Familiar with open-source monitoring/alerting ( Prometheus & Grafana) and logging systems (Loki, Graylog, or ELK).
- Networking: Strong foundation in networking protocols (TCP/IP, BGP, VPN, Load Balancing) with the ability to diagnose complex connectivity issues.
- Industry Context: Prior experience in Financial Services (FinTech, Banking, or Payments) is highly preferred.
- Soft Skills: Proven ability to work under pressure with tight timelines proactive team player with strong professional ethics.