Responsibilities:
Incident & Application Support
- Provide second-line (L2) support for production and staging systems, handling escalations from L1 Support.
- Investigate application errors, system alerts, performance degradation, and integration issues.
- Restore services within agreed SLA/OLA timelines and ensure proper incident closure
Troubleshooting & Root Cause Analysis
- Perform in-depth troubleshooting using logs, metrics, and monitoring tools.
- Conduct root cause analysis (RCA) for recurring or high-impact incidents.
- Propose and implement corrective and preventive actions to reduce incident recurrence.
Collaboration & Escalation
- Work closely with L3 engineers, DevOps, and vendors to resolve complex technical issues.
- Provide clear technical findings, logs, and evidence when escalating issues.
- Participate in incident bridges, post-incident reviews, and operational discussions.
Operational Excellence
- Monitor system health, alerts, dashboards, and logs to proactively identify issues.
- Execute approved configuration changes, patches, and operational fixes.
- Support deployment, release, and maintenance activities when required.
Automation & Continuous Improvement
- Contribute to automation of operational tasks, monitoring, and alerting where applicable.
- Identify gaps in runbooks, SOPs, and operational processes and drive improvements
Documentation
- Maintain and update runbooks, troubleshooting guides, and knowledge base articles.
- Document incident resolutions and operational procedures clearly and accurately.
Security & Compliance
- Adhere to security, access control, and compliance requirements.
- Handle sensitive information in logs, tickets, and systems appropriately.
- Support audits, vulnerability remediation, and compliance checks when required.
Key Experiences and Qualifications:
Educational Background
- Diploma or higher in Computer Science, Information Technology, or a related field
Professional Experience
- 3-5+ years of experience in application support, systems support, or operations roles
- Experience supporting production systems in high-availability or mission-critical environments
Technical Expertise
- Strong hands-on experience with:Application log analysis and monitoring tools (e.g., AWS CloudWatch, Grafana, ELK, Google Analytics)Linux/Unix environments
- Working knowledge of cloud platforms (e.g., AWS services such as ECS, Lambda, S3, RDS)
- Basic database knowledge (MySQL, PostgreSQL) for health checks and simple queries
- Basic understanding of REST APIs, system integrations, and authentication design
- Familiarity with incident, problem, and change management processes
Problem-Solving Skills
- Strong analytical and troubleshooting abilities
- Ability to break down complex incidents into clear, actionable steps
- Calm and methodical approach when handling production issues under pressure
Operational Practices
- Familiarity with ticketing and incident management tools (e.g., Jira, PagerDuty)
- Experience working with runbooks, SOPs, and on-call support rotations (if applicable)
Additional Skills (Preferred / Bonus)
- Experience supporting cloud-native or microservices-based systems
- Basic scripting skills (e.g., Bash, Python) for automation
- Experience working in government, regulated, or large-scale enterprise environments
- Knowledge of disaster recovery and business continuity planning
Key Character Traits
- Team player with a collaborative mindset
- Strong sense of ownership and accountability for system reliability
- Proactive in identifying and resolving operational issues
- Willingness and ability to learn and adapt to new systems and tools
- Open to knowledge sharing and team capability improvement
- Clear verbal and written communication skills, including incident reporting