Operations Support Engineer

WEBSPARKS PTE. LTD.

Singapore

5-7 Years

SGD 7,000 - 12,000 per month

Save

Posted 2 days ago
Be among the first 10 applicants

Early Applicant

Job Description

1-year contract
Government project
Hybrid work arrangement

KEY RESPONSIBILITIES

Infrastructure Management: Design, build, and maintain critical cloud infrastructure platforms encompassing compute, storage, networking, containerisation, virtualisation, DNS, monitoring, and supporting systems across development, staging, and production environments. Monitor and manage comprehensive cloud services including CloudWatch logs, alarms, synthetic monitoring, and integrated third-party solutions.

Monitoring and Observability: Implement and maintain robust monitoring and observability frameworks for all platform components utilising modern tooling including AWS CloudWatch Canaries, StackOps, Prometheus, Grafana, and ELK stack implementations. Establish comprehensive observability practices to support proactive problem diagnosis and provide actionable insights into system health and performance metrics.

Compliance and Security: Maintain adherence to Whole-of-Government platform standards, compliance frameworks, and security requirements through continuous monitoring using government-approved security and monitoring solutions. Implement security controls including access management, security hardening, and compliance monitoring with tools such as CyberArk.

Automation and Infrastructure as Code: Develop and maintain infrastructure using Infrastructure as Code (IaC) methodologies with tools including Terraform, Ansible, and AWS CloudFormation to ensure repeatable, automated, and version-controlled deployments. Follow platform standards whilst executing infrastructure automation and modern operational practices to enhance efficiency and reliability.

Site Reliability Engineering: Identify and eliminate repetitive operational tasks to improve Developer and Infrastructure Engineer efficiency whilst enhancing overall system reliability through systematic toil elimination and error budget management. Define, track, and report on SRE metrics including Service Level Objectives (SLO), Service Level Indicators (SLI), and error budgets.

Platform Operations: Manage virtualisation platforms including VMware vSphere and Hyper-V, encompassing capacity monitoring, performance optimisation, and lifecycle management. Administer AWS Cloud services including EC2, ECS, S3, RDS (PostgreSQL and MS SQL), Docker/Kubernetes, Lambda, CloudFormation, CloudWatch, IAM, and VPC configurations alongside physical server infrastructure.

Network and System Administration: Demonstrate proficiency with local networking technologies including TCP/IP, DNS, DHCP, VPN configurations, and routing protocols. Execute comprehensive platform patching strategies leveraging automation to maintain security and stability whilst minimising service disruption.

Business Continuity: Maintain backup, disaster recovery, and high availability solutions for critical platform components including AWS Fault Injection Simulator (FIS) testing and multi-availability zone configurations. Support containerisation initiatives and maintain container orchestration platforms for traditional workloads.

Collaboration and Documentation: Collaborate effectively with application teams to support platform stability, performance, and scalability requirements. Create and maintain comprehensive platform documentation, operational runbooks, and standard operating procedures. Support team development through knowledge sharing and mentoring on platform operations and modern infrastructure practices.

EXPERIENCE AND SKILLS REQUIRED

Technical Expertise:

Advanced experience with enterprise virtualisation platforms (VMware vSphere, Hyper-V)
Proficiency in Linux and Windows Server administration
Expertise in server monitoring tool installation and regular patching of virtual and physical servers
Comprehensive health check capabilities for servers, storage, and virtualisation platforms
Strong experience with infrastructure automation tools (Ansible, Puppet, Chef)
Proficiency with container technologies (ECS, Docker, Kubernetes)
Experience with monitoring and observability platforms
Infrastructure as Code expertise (Terraform, AWS CloudFormation, Ansible)
Solid understanding of networking concepts and technologies
Scripting capabilities in Python, PowerShell, Bash, and Node.js
Experience with high-availability and disaster recovery solutions including AWS FIS
Proficiency with GitHub tools and CI/CD pipeline setup and workflow management

Professional Qualifications: Bachelors degree in computer science, Information Technology, or related technical discipline with demonstrated experience in infrastructure operations and engineering. Strong understanding of enterprise infrastructure components with proven experience supporting infrastructure modernisation initiatives.

Core Competencies: Excellent analytical and problem-solving capabilities with strong documentation skills and effective communication abilities for both technical and non-technical stakeholders.

Desired Certifications:

VMware Certified Professional (VCP) or Windows vSphere
Microsoft Certified: Windows Server
Red Hat Certified Engineer (RHCE)
AWS Certified Solutions Architect or AWS Certified SysOps Administrator
Additional certifications in networking, security, or government IT standards. Previous experiences in government or highly regulated environments are strongly preferred.

SENIOR OPERATIONS SUPPORT ENGINEER - ADDITIONAL REQUIREMENTS

Leadership and Management: Lead infrastructure engineering teams to deliver comprehensive managed services for entire IT infrastructure environments. Direct desktop engineering teams to provide first-level support and technical problem resolution for end-user communities.

Strategic Operations: Oversee and direct daily IT infrastructure operations, ensuring reliable and secure system, service, and application performance. Monitor and manage incident response for business-critical systems with focus on timely resolution to prevent operational delays and service outages.

Organisational Engagement: Demonstrate capability to engage effectively with organisational management whilst establishing guidelines, policies, and procedures with strong execution oversight. Manage multiple concurrent deadlines as a self-directed professional with appropriate prioritisation skills.

Operational Excellence: Monitor and respond to data centre issues and incidents whilst performing routine operational checks on servers, network devices, storage, and environmental systems. Track IT asset inventory ensuring comprehensive equipment accountability and end-of-life management.

Incident and Change Management: Respond promptly to system alerts, alarms, and incidents with appropriate escalation to support teams following defined procedures. Support incident troubleshooting and recovery activities whilst managing planned maintenance, change requests, and scheduled outages. Coordinate hardware installation, replacement, and decommissioning activities alongside media handling and secure storage management.