Search by job, company or skills

P

Service Delivery Lead - L3 AWS Cloud

3-6 Years
SGD 7,000 - 9,000 per month
Save
new job description bg glownew job description bg glow
  • Posted 17 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Responsibilities:

Multi-Cloud Infrastructure Operations

  • Operate, maintain, and continuously improve cloud-native production environments across Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).
  • Provide hands-on technical leadership across a broad range of cloud services, including but not limited to:
  • AWS: Lambda, ECS/EKS, FSx, Glue, SES, GuardDuty, WAF, Shield Advanced, Security Hub, KMS, Secrets Manager, SNS, SQS, EventBridge, API Gateway, EC2, S3, CloudWatch, Systems Manager
  • Azure: Virtual Machines, Azure Kubernetes Service (AKS), Azure Functions, Azure Storage, Azure Monitor
  • GCP: Compute Engine, Google Kubernetes Engine (GKE), Cloud Functions, Cloud Storage, Cloud Monitoring
  • Monitor, analyze, and troubleshoot infrastructure performance, availability, scalability, and cost efficiency across all cloud platforms.
  • Support both production and staging environments, ensuring adherence to 24/7 high-availability and reliability objectives, including strict SLA and SLO commitments.
  • Participate in a 24/7 shift rotation to provide round-the-clock operational coverage.
  • Provide hands-on technical support and guidance to L2 engineers, leading incident response, root-cause analysis, and resolution of complex infrastructure and application issues.

Operating System Lifecycle & Patch Management

  • Lead and oversee operating system patching and lifecycle management across RHEL (v8-v10) and Windows Server (2016-2025) environments using tools such as AWS Systems Manager Patch Manager, Azure Update Management, WSUS, SCCM, and YUM/DNF.
  • Maintain strong foundational knowledge of Linux system administration, complemented by deep expertise in Windows (Wintel) operating system patching, hardening, and lifecycle management.
  • Plan, schedule, automate, and track patch deployments across development, staging, and production environments, ensuring consistency and repeatability.
  • Coordinate patch approvals with security, compliance, and business stakeholders to ensure alignment with organizational policies, risk frameworks, and audit requirements.
  • Execute monthly and quarterly patching cycles with minimal service disruption, adhering to defined change management and maintenance windows.
  • Perform post-patch validation, health checks, and remediation activities to confirm system stability, security posture, and operational readiness.

Application Deployment & Troubleshooting

  • Deploy, operate, and troubleshoot applications across Windows and Linux operating systems in cloud-based and hybrid environments.
  • Provide OS-level diagnostics, performance tuning, and stability support to application teams, including CPU, memory, disk, network, and process-level analysis.
  • Partner closely with development and DevOps teams to identify, isolate, and resolve infrastructure-, platform-, and OS-related application issues throughout the application lifecycle.
  • Implement, maintain, and continuously enhance application monitoring, logging, and alerting frameworks to ensure early issue detection and rapid incident response in production environments.

Security & Compliance

  • Execute and manage CIS (Center for Internet Security) control implementations and remediations across multi-cloud environments to strengthen security posture.
  • Perform security hardening in accordance with CIS Benchmarks, industry best practices, and government-mandated security baselines.
  • Conduct continuous vulnerability identification, assessment, and remediation using tools such as Trend Micro Vision One, Qualys, Tenable, and AWS Config, ensuring timely risk mitigation.
  • Track, manage, and renew SSL/TLS certificates across all environments to prevent service disruptions and maintain secure communications.
  • Proactively identify and remediate End-of-Life (EOL) and End-of-Support (EOS) components, including operating systems, middleware, and AWS Lambda runtimes, to reduce security and compliance risks.
  • Support and maintain compliance with government-grade security, audit, and regulatory requirements, including evidence collection, audit readiness, and remediation tracking.

Container & DevSecOps

  • Demonstrate strong working knowledge of container and orchestration technologies, including Docker, Kubernetes, and managed container platforms such as AWS ECS/EKS, Azure AKS, and Google GKE.
  • Apply familiarity with DevSecOps principles and practices, including exposure to SHIP-HATS (Secure Hybrid Integration Pipeline - Hive Agile Testing Solutions) within the Singapore Government technology ecosystem.
  • Support and maintain CI/CD pipeline operations, ensuring seamless integration with security scanning, vulnerability assessment, and compliance validation tools across the software delivery lifecycle.

ITIL & Service Management

  • Adhere to ITIL-based service management processes, including Incident, Problem, Change, and Request Management, ensuring consistent and controlled service delivery.
  • Manage, prioritize, and resolve ITSM tickets using platforms such as ServiceNow, Jira, or equivalent tools, meeting defined service commitments and response targets.
  • Drive timely and effective ticket escalation and coordination between engineering teams, service owners, and stakeholders to ensure prompt issue resolution.
  • Coordinate and govern change management activities, including preparing change documentation and participating in Change Advisory Board (CAB) reviews, providing guidance and oversight to junior engineers.
  • Monitor, maintain, and report against Service Level Agreements (SLAs) and Operational Level Agreements (OLAs) to ensure service performance, accountability, and continuous improvement.

Documentation & Knowledge Management

  • Create, maintain, and continuously update comprehensive infrastructure runbooks, system documentation, architecture design artefacts, and change-tracking logs for assigned applications and platforms.
  • Develop and standardize Standard Operating Procedures (SOPs), operational guidelines, and knowledge base articles to support consistent service delivery and efficient incident resolution.
  • Ensure audit readiness through disciplined documentation practices, including version control, traceability, and alignment with security and compliance requirements.
  • Maintain accurate Configuration Management Databases (CMDB) and asset inventories, ensuring alignment with deployed infrastructure and operational states.

Leadership & Mentorship

  • Provide technical leadership, guidance, and mentorship to Level 2 and junior engineers, fostering skill development, accountability, and operational excellence.
  • Lead and facilitate technical discussions, design reviews, and architecture governance forums, ensuring solutions align with organizational standards, security requirements, and best practices.
  • Plan and deliver knowledge transfer sessions, technical training, and operational walkthroughs to uplift team capability and reduce single points of failure.
  • Act as the primary escalation point for complex or high-impact technical issues, driving root-cause analysis and sustainable long-term remediation.
  • Champion continuous improvement initiatives, automation adoption, and operational best practices to enhance service reliability, efficiency, and team maturity.

Soft Skills & Competencies

  • Problem Solving - Demonstrates advanced troubleshooting and analytical skills to diagnose and resolve complex issues across multi-cloud and hybrid environments.
  • Communication - Communicates clearly and effectively with technical and non-technical audiences, including engineers, stakeholders, and senior management.
  • Leadership - Provides direction and influence to guide teams, drive technical initiatives, and deliver high-quality outcomes.
  • Collaboration - Works effectively across engineering, security, operations, and business teams to achieve shared objectives.
  • Adaptability - Remains responsive and effective in fast-changing, dynamic, and high-pressure environments.
  • Accountability & Attention to Detail - Takes ownership of service delivery and outcomes, ensuring accuracy, reliability, security, and compliance in all implementations.
  • Customer Focus - Maintains a service-oriented mindset with strong stakeholder management and a commitment to meeting business and customer needs.
  • Continuous Learning - Proactively stays current with evolving cloud technologies, security standards, and industry best practices.
  • Resilience - Performs effectively under pressure, particularly during incidents, outages, and critical operational situations.
  • Mentorship - Actively develops, coaches, and supports junior engineers to build team capability and long-term sustainability.

This Subject Matter Expert (SME) role requires the individual to consistently demonstrate the following behaviors and capabilities:

  • Deep proficiency in Amazon Web Services (AWS), with solid working knowledge of Microsoft Azure and Google Cloud Platform (GCP) to support and guide multi-cloud operations.
  • Proven ability to operate within uptime-critical, security-sensitive, and compliance-driven environments, maintaining service reliability and operational excellence.
  • Strong technical leadership and mentorship capabilities, providing guidance, oversight, and skills development for junior and mid-level engineers.
  • A proactive mindset focused on incident prevention, continuous improvement, and adoption of best practices to enhance system stability and resilience.
  • A calm, structured, and methodical approach to incident management, with strict adherence to change management, incident response, and escalation procedures.
  • An audit-readiness mindset, supported by rigorous documentation, traceability, and evidence-based operational practices.
  • Ability to drive technical escalations, coordinate cross-functional resolution efforts, and manage clear, timely stakeholder communications during incidents and service-impacting events.
  • Demonstrated experience working within Singapore Government technology frameworks, policies, and regulatory standards, including alignment with public-sector governance and security requirements.

More Info

Job Type:
Industry:
Employment Type:

Job ID: 147645697