Responsibilities:
- System Monitoring & Performance: Proactively monitor production and non-production systems, identify anomalies, and optimize performance. Collaborate with engineering and security teams to implement improvements.
- Incident Management & Problem Resolution: Respond to L1, L2, and L3 incidents, perform root-cause analysis, and implement corrective actions. Coordinate with internal teams and vendors to meet SLAs and escalate critical issues as needed.
- Operational Documentation & Governance: Maintain runbooks, configuration guides, and operational procedures. Ensure compliance with security, audit, and regulatory standards.
- Daily Operations & Reporting: Oversee ticket handling, system checks, and release support. Generate performance reports, incident summaries, and operational dashboards for stakeholders.
- Automation & Tooling: Use scripting and automation tools (Terraform, Ansible, Python, Shell) to reduce manual tasks and improve efficiency. Configure monitoring solutions across applications and infrastructure using APM tools and cloud-native monitoring.
- Security & Compliance: Implement access controls and enforce security best practices to protect data and maintain compliance.
- Cloud & DevOps Support: Support cloud environments on AWS, Azure, or Google Cloud. Participate in CI/CD pipelines, DevOps workflows, and environment maintenance to ensure smooth deployments.
- Collaboration & Continuous Improvement: Partner with cross-functional teams to enhance system reliability. Drive process improvements and propose innovative solutions for operational excellence.
Requirements:
- Strong understanding of operational support, incident management, and site reliability practices.
- Hands-on experience with ITSM tools (e.g., Remedy, Zendesk, ServiceDesk) and monitoring solutions (APM, cloud-native monitoring, log analytics).
- Experience with automation using Terraform, Ansible, Python, or Shell scripting.
- Proficiency with cloud platforms (AWS, Azure, Google Cloud) and modern DevOps/CI/CD workflows.
- Knowledge of security monitoring, access control, and compliance standards.
- Excellent analytical and troubleshooting skills with a proactive, solution-oriented mindset.
- Strong communication skills and ability to collaborate across teams.
- Highly organized, capable of managing multiple priorities in a fast-paced environment.
- Bachelor's Degree or Diploma in Computer Science, Information Technology, Computer Engineering, or related disciplines.
- Previous experience as an Operations Support Engineer, Site Reliability Engineer, or similar operational role.
- Cloud certifications (AWS, Azure, Google Cloud) are highly preferred.
Interested candidates are encouraged to submit their resumes outlining their relevant experience and achievements to apply88(@)talentvis.comor click apply!
..We regret to inform that only shortlisted candidates would be notified..
EA License No: 04C3537
EA Personnel No: R22106683
EA Personnel Name: Yang Hui Shan, Sherri