We are seeking an accomplished Cloud Technical Manager with deep technical expertise across multi-cloud platforms to lead cloud infrastructure operations for both commercial and Singapore Government-appointed agencies. This senior technical leadership role demands extensive hands-on experience managing complex cloud environments across Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), with strong focus on both Linux and Windows system administration. The ideal candidate will possess L3-level technical proficiency, demonstrate exceptional stakeholder management capabilities, and provide technical guidance to engineering teams while driving operational excellence, security hardening, and infrastructure modernization initiatives.
Key Responsibilities
Multi-Cloud Infrastructure Leadership & Architecture:
- Lead the design, deployment, and management of cloud-native architectures across AWS, Microsoft Azure, and Google Cloud Platform in production environments
- Architect and implement scalable, highly available, and secure multi-cloud solutions aligned with business requirements and government compliance standards
- Provide technical leadership for cloud services including: EC2, S3, Lambda, ECS/EKS, RDS, CloudWatch, Systems Manager, Azure Virtual Machines, Azure Kubernetes Service (AKS), Azure Monitor, Compute Engine, Google Kubernetes Engine (GKE), Cloud Functions, Cloud Storage, and Cloud Monitoring
- Design and implement infrastructure architecture for new application deployments, ensuring best practices in scalability, performance, and cost optimization
- Evaluate and recommend cloud technologies, services, and architectural patterns to support business objectives and digital transformation initiatives
- Lead migration initiatives from on-premises to cloud and cloud-to-cloud migrations across AWS, Azure, and GCP
- Monitor and optimize cloud resource utilization, implementing cost management strategies and right-sizing recommendations
Technical Team Leadership & Mentorship:
- Provide technical leadership, guidance, and mentorship to L2 Linux Engineers, L2 Wintel Engineers, and L3 Cloud Engineers
- Conduct technical design reviews, code reviews for Infrastructure as Code (IaC), and architectural assessments
- Act as the technical escalation point for complex infrastructure issues requiring advanced troubleshooting and resolution
- Drive knowledge transfer initiatives, facilitate technical training sessions, and develop engineering team capabilities
- Lead incident response for critical production issues, coordinating cross-functional teams and ensuring rapid resolution
- Foster a culture of operational excellence, automation, continuous improvement, and technical innovation
- Participate in 24/7 shift rotation and on-call escalation support to provide leadership during critical incidents
Operating System Lifecycle & Patch Management:
- Oversee and coordinate enterprise-wide OS patching operations across RHEL (v7 to v10) and Windows Server (2016 to 2025) environments using native tools eg. AWS Systems Manager, Azure Update Management, WSUS, SCCM, and YUM/DNF
- Demonstrate advanced proficiency in both Linux and Windows system administration with the ability to troubleshoot complex issues across both platforms
- Develop and enforce patching strategies, policies, and schedules aligned with security compliance requirements and business continuity objectives
- Lead monthly and quarterly patch cycles, ensuring comprehensive testing, validation, and rollback procedures
- Coordinate patch approvals with Change Advisory Board (CAB) and manage stakeholder communications throughout patching activities
- Execute post-patch validation, remediation activities, and compliance reporting for audit requirements
- Identify and manage End-of-Life (EOL) operating systems and applications, planning upgrade and migration strategies
Security Hardening & Compliance Management:
- Lead CIS (Center for Internet Security) security hardening initiatives and remediation activities across all cloud platforms and operating systems
- Implement and maintain security baselines based on CIS Benchmarks, government security standards (IM8 Policy), and industry best practices
- Oversee vulnerability management programs using tools such as Trend Micro, Qualys, Tenable, and AWS Config
- Prioritize, coordinate, and track security remediation efforts across infrastructure teams to ensure timely resolution of vulnerabilities
- Manage SSL/TLS certificate lifecycle, including renewals, implementation, and monitoring across multi-cloud environments
- Ensure compliance with government-level security, audit, and regulatory requirements including SOC 2, ISO 27001, and Singapore government frameworks
- Collaborate with InfoSec teams on security assessments, penetration testing, and audit preparations
- Implement and maintain security monitoring, logging, and alerting mechanisms using native cloud tools and third-party solutions
Infrastructure as Code (IaC) & Automation:
- Lead Infrastructure as Code initiatives using Terraform, Ansible, AWS CloudFormation, and Azure Resource Manager (ARM) templates
- Design and implement automated infrastructure deployment pipelines with CI/CD integration
- Troubleshoot complex environment drift, pipeline failures, and infrastructure provisioning issues across multi-cloud environments
- Implement and maintain GitOps practices for infrastructure deployment and version control
- Drive automation initiatives to reduce manual operational overhead and improve infrastructure reliability
ITIL Process Management & Service Delivery:
- Oversee ITIL processes including Incident Management, Problem Management, Change Management, and Request Management
- Manage and optimize ITSM workflows using ServiceNow, Jira, or similar enterprise ITSM platforms
- Lead Change Advisory Board (CAB) reviews for infrastructure changes, providing technical assessment and risk analysis
- Drive incident escalation processes, root cause analysis (RCA), and Post-Incident Review (PIR) activities
- Ensure compliance with Service Level Agreements (SLAs) and Operational Level Agreements (OLAs)
- Implement continuous service improvement initiatives based on operational metrics, KPIs, and stakeholder feedback
- Maintain comprehensive documentation including runbooks, standard operating procedures (SOPs), and architectural diagrams
Stakeholder Management & Communication:
- Act as the primary technical liaison between infrastructure teams and business stakeholders, application owners, and senior management
- Manage expectations and communicate technical concepts effectively to both technical and non-technical audiences
- Coordinate with cross-functional teams including Development, Security, Networking, and Database teams on infrastructure initiatives
- Lead technical discussions, architecture reviews, and solution design sessions with stakeholders
- Provide regular status updates, operational reports, and capacity planning recommendations to management
- Manage vendor relationships for cloud services, security tools, and infrastructure platforms
- Facilitate communication during critical incidents, ensuring timely updates to all stakeholders and maintaining service transparency
Container Orchestration & DevSecOps:
- Provide technical leadership for containerization initiatives using Docker, Kubernetes, Amazon ECS, Amazon EKS, Azure AKS, and Google GKE
- Implement and maintain DevSecOps practices with SHIP-HATS (Secure Hybrid Integration Pipeline - Hive Agile Testing Solutions) within Singapore Government technology stack
- Oversee CI/CD pipeline operations, integrating security scanning tools including SAST, DAST, and container vulnerability scanning
- Drive containerization strategy and microservices architecture adoption across application portfolios
- Monitoring, Observability & Performance Optimization
- Design and implement comprehensive monitoring, logging, and alerting strategies using CloudWatch, Azure Monitor, GCP Cloud Monitoring, and third-party observability platforms
- Configure and maintain observability stacks for metrics, logs, traces, and alerts across multi-cloud environments
- Implement log aggregation and analysis using centralized logging solutions
- Lead performance optimization initiatives, conducting capacity planning and resource right-sizing activities
- Establish operational dashboards, reporting mechanisms, and proactive alerting for infrastructure health and performance
Documentation & Knowledge Management:
- Create and maintain comprehensive infrastructure documentation, including system architecture diagrams, network topology, and data flow diagrams
- Develop and maintain technical runbooks, troubleshooting guides, and disaster recovery procedures
- Ensure audit-readiness through meticulous documentation discipline and change tracking
- Maintain Configuration Management Database (CMDB) accuracy and asset inventories
- Build and maintain knowledge base articles, FAQs, and best practice documentation for team reference