
Search by job, company or skills
Responsibilities
Incident & Application Support
Deliver second-line (L2) support for production and staging environments, managing escalations from L1 Support.
Investigate application errors, system alerts, performance issues, and integration-related problems.
Restore services within the agreed SLA/OLA timelines and ensure incidents are properly closed.
Troubleshooting & Root Cause Analysis
Carry out detailed troubleshooting using logs, metrics, and monitoring tools.
Perform root cause analysis (RCA) for recurring incidents or those with high business impact.
Recommend and implement corrective and preventive measures to minimise incident recurrence.
Collaboration & Escalation
Collaborate closely with L3 engineers, DevOps teams, and external vendors to resolve complex technical issues.
Provide clear technical findings, relevant logs, and supporting evidence when escalating issues.
Take part in incident bridges, post-incident reviews, and operational discussions.
Operational Excellence
Monitor system health, alerts, dashboards, and logs to proactively detect potential issues.
Carry out approved configuration changes, patches, and operational fixes.
Support deployment, release, and maintenance activities as needed.
Automation & Continuous Improvement
Contribute to the automation of operational tasks, monitoring, and alerting where relevant.
Identify gaps in runbooks, SOPs, and operational processes, and drive enhancements.
Documentation
Maintain and update runbooks, troubleshooting guides, and knowledge base documentation.
Record incident resolutions and operational procedures in a clear and accurate manner.
Security & Compliance
Comply with security, access control, and regulatory requirements.
Handle sensitive information in logs, tickets, and systems with appropriate care.
Support audits, vulnerability remediation, and compliance reviews when necessary.
Key Experiences and Qualifications We Seek
Educational Background:
Diploma or above in Computer Science, Information Technology, or a related discipline.
Professional Experience:
35+ years of relevant experience in application support, systems support, or operations roles.
Experience supporting production systems in high-availability or mission-critical environments.
Technical Expertise:
Strong hands-on experience in:
o Application log analysis and monitoring tools (e.g. AWS CloudWatch, Grafana, ELK, Google Analytics, etc.)
o Linux/Unix environments
Working knowledge of cloud platforms, such as AWS services including ECS, Lambda, S3, and RDS.
Basic database knowledge in MySQL and PostgreSQL for health checks and simple queries.
Basic understanding of REST APIs, system integrations, and authentication design.
Familiarity with incident, problem, and change management processes.
Problem-Solving Skills:
Strong analytical and troubleshooting capabilities.
Ability to break down complex incidents into clear and actionable steps.
Calm, structured, and methodical when handling production issues under pressure.
Job ID: 145249145