Responsibilities:
Multi-Cloud Infrastructure Operations
- Operate, maintain, and continuously improve cloud-native production environments across Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).
- Provide hands-on technical leadership across a broad range of cloud services, including but not limited to:
- AWS: Lambda, ECS/EKS, FSx, Glue, SES, GuardDuty, WAF, Shield Advanced, Security Hub, KMS, Secrets Manager, SNS, SQS, EventBridge, API Gateway, EC2, S3, CloudWatch, Systems Manager
- Azure: Virtual Machines, Azure Kubernetes Service (AKS), Azure Functions, Azure Storage, Azure Monitor
- GCP: Compute Engine, Google Kubernetes Engine (GKE), Cloud Functions, Cloud Storage, Cloud Monitoring
- Monitor, analyze, and troubleshoot infrastructure performance, availability, scalability, and cost efficiency across all cloud platforms.
- Support both production and staging environments, ensuring adherence to 24/7 high-availability and reliability objectives, including strict SLA and SLO commitments.
- Participate in a 24/7 shift rotation to provide round-the-clock operational coverage.
- Provide hands-on technical support and guidance to L2 engineers, leading incident response, root-cause analysis, and resolution of complex infrastructure and application issues.
Operating System Lifecycle & Patch Management
- Lead and oversee operating system patching and lifecycle management across RHEL (v8-v10) and Windows Server (2016-2025) environments using tools such as AWS Systems Manager Patch Manager, Azure Update Management, WSUS, SCCM, and YUM/DNF.
- Maintain strong foundational knowledge of Linux system administration, complemented by deep expertise in Windows (Wintel) operating system patching, hardening, and lifecycle management.
- Plan, schedule, automate, and track patch deployments across development, staging, and production environments, ensuring consistency and repeatability.
- Coordinate patch approvals with security, compliance, and business stakeholders to ensure alignment with organizational policies, risk frameworks, and audit requirements.
- Execute monthly and quarterly patching cycles with minimal service disruption, adhering to defined change management and maintenance windows.
- Perform post-patch validation, health checks, and remediation activities to confirm system stability, security posture, and operational readiness.
Application Deployment & Troubleshooting
- Deploy, operate, and troubleshoot applications across Windows and Linux operating systems in cloud-based and hybrid environments.
- Provide OS-level diagnostics, performance tuning, and stability support to application teams, including CPU, memory, disk, network, and process-level analysis.
- Partner closely with development and DevOps teams to identify, isolate, and resolve infrastructure-, platform-, and OS-related application issues throughout the application lifecycle.
- Implement, maintain, and continuously enhance application monitoring, logging, and alerting frameworks to ensure early issue detection and rapid incident response in production environments.
Security & Compliance
- Execute and manage CIS (Center for Internet Security) control implementations and remediations across multi-cloud environments to strengthen security posture.
- Perform security hardening in accordance with CIS Benchmarks, industry best practices, and government-mandated security baselines.
- Conduct continuous vulnerability identification, assessment, and remediation using tools such as Trend Micro Vision One, Qualys, Tenable, and AWS Config, ensuring timely risk mitigation.
- Track, manage, and renew SSL/TLS certificates across all environments to prevent service disruptions and maintain secure communications.
- Proactively identify and remediate End-of-Life (EOL) and End-of-Support (EOS) components, including operating systems, middleware, and AWS Lambda runtimes, to reduce security and compliance risks.
- Support and maintain compliance with government-grade security, audit, and regulatory requirements, including evidence collection, audit readiness, and remediation tracking.
Container & DevSecOps
- Demonstrate strong working knowledge of container and orchestration technologies, including Docker, Kubernetes, and managed container platforms such as AWS ECS/EKS, Azure AKS, and Google GKE.
- Apply familiarity with DevSecOps principles and practices, including exposure to SHIP-HATS (Secure Hybrid Integration Pipeline - Hive Agile Testing Solutions) within the Singapore Government technology ecosystem.
- Support and maintain CI/CD pipeline operations, ensuring seamless integration with security scanning, vulnerability assessment, and compliance validation tools across the software delivery lifecycle.
ITIL & Service Management
- Adhere to ITIL-based service management processes, including Incident, Problem, Change, and Request Management, ensuring consistent and controlled service delivery.
- Manage, prioritize, and resolve ITSM tickets using platforms such as ServiceNow, Jira, or equivalent tools, meeting defined service commitments and response targets.
- Drive timely and effective ticket escalation and coordination between engineering teams, service owners, and stakeholders to ensure prompt issue resolution.
- Coordinate and govern change management activities, including preparing change documentation and participating in Change Advisory Board (CAB) reviews, providing guidance and oversight to junior engineers.
- Monitor, maintain, and report against Service Level Agreements (SLAs) and Operational Level Agreements (OLAs) to ensure service performance, accountability, and continuous improvement.
Documentation & Knowledge Management
- Create, maintain, and continuously update comprehensive infrastructure runbooks, system documentation, architecture design artefacts, and change-tracking logs for assigned applications and platforms.
- Develop and standardize Standard Operating Procedures (SOPs), operational guidelines, and knowledge base articles to support consistent service delivery and efficient incident resolution.
- Ensure audit readiness through disciplined documentation practices, including version control, traceability, and alignment with security and compliance requirements.
- Maintain accurate Configuration Management Databases (CMDB) and asset inventories, ensuring alignment with deployed infrastructure and operational states.
Leadership & Mentorship
- Provide technical leadership, guidance, and mentorship to Level 2 and junior engineers, fostering skill development, accountability, and operational excellence.
- Lead and facilitate technical discussions, design reviews, and architecture governance forums, ensuring solutions align with organizational standards, security requirements, and best practices.
- Plan and deliver knowledge transfer sessions, technical training, and operational walkthroughs to uplift team capability and reduce single points of failure.
- Act as the primary escalation point for complex or high-impact technical issues, driving root-cause analysis and sustainable long-term remediation.
- Champion continuous improvement initiatives, automation adoption, and operational best practices to enhance service reliability, efficiency, and team maturity.
Soft Skills & Competencies
- Problem Solving - Demonstrates advanced troubleshooting and analytical skills to diagnose and resolve complex issues across multi-cloud and hybrid environments.
- Communication - Communicates clearly and effectively with technical and non-technical audiences, including engineers, stakeholders, and senior management.
- Leadership - Provides direction and influence to guide teams, drive technical initiatives, and deliver high-quality outcomes.
- Collaboration - Works effectively across engineering, security, operations, and business teams to achieve shared objectives.
- Adaptability - Remains responsive and effective in fast-changing, dynamic, and high-pressure environments.
- Accountability & Attention to Detail - Takes ownership of service delivery and outcomes, ensuring accuracy, reliability, security, and compliance in all implementations.
- Customer Focus - Maintains a service-oriented mindset with strong stakeholder management and a commitment to meeting business and customer needs.
- Continuous Learning - Proactively stays current with evolving cloud technologies, security standards, and industry best practices.
- Resilience - Performs effectively under pressure, particularly during incidents, outages, and critical operational situations.
- Mentorship - Actively develops, coaches, and supports junior engineers to build team capability and long-term sustainability.
This Subject Matter Expert (SME) role requires the individual to consistently demonstrate the following behaviors and capabilities:
- Deep proficiency in Amazon Web Services (AWS), with solid working knowledge of Microsoft Azure and Google Cloud Platform (GCP) to support and guide multi-cloud operations.
- Proven ability to operate within uptime-critical, security-sensitive, and compliance-driven environments, maintaining service reliability and operational excellence.
- Strong technical leadership and mentorship capabilities, providing guidance, oversight, and skills development for junior and mid-level engineers.
- A proactive mindset focused on incident prevention, continuous improvement, and adoption of best practices to enhance system stability and resilience.
- A calm, structured, and methodical approach to incident management, with strict adherence to change management, incident response, and escalation procedures.
- An audit-readiness mindset, supported by rigorous documentation, traceability, and evidence-based operational practices.
- Ability to drive technical escalations, coordinate cross-functional resolution efforts, and manage clear, timely stakeholder communications during incidents and service-impacting events.
- Demonstrated experience working within Singapore Government technology frameworks, policies, and regulatory standards, including alignment with public-sector governance and security requirements.