Application Support Engineer (SRE)

knovel engineering pte. ltd.

Singapore, Ubi

3-6 Years

SGD 4,000 - 6,500 per month

Save

Posted 2 months ago
Be among the first 20 applicants

Early Applicant

Job Description

About Knovel

At our core, our passion is to craft novel AI and technology solutions that will shape tomorrow. We deploy cutting-edge technology that builds on cloud computing to proliferate AI, data and analytics solutions tailored to drive innovation and transform businesses.

With our desire to push the boundaries of technology, we partner closely with our clients. Guiding their transformation with agility, we apply a structured technology transformation process attuned to their unique challenges.

At Knovel Engineering, we blend technology with creativity to build unique solutions tailored for our customers.

About the Role

We are looking for a Senior Application Support Engineer to own the reliability of our AI-native ecosystem. You will lead the resolution of high-impact incidents, perform deep-dive root-cause analysis across cloud and containerized environments, and proactively optimize system performance through advanced observability. By automating diagnostic workflows and collaborating with core development teams, you will play a critical role in the formal commissioning and long-term stability of our production products.

Key Responsibilities

Act as the primary escalation point for complex technical incidents arising from our commissioned AI-native applications, platforms, and SaaS products, taking ownership of issues through to resolution.
Conduct deep-dive root-cause analysis for challenging production problems, systematically investigating across application code, cloud infrastructure, container orchestration, and ML model behaviour to identify underlying defects.
Serve as the technical authority and primary communicator for customers during high-impact incidents, providing clear, accurate, and timely updates while managing service level agreement (SLA) commitments.
Author, maintain, and improve operational runbooks, knowledge base articles, and diagnostic scripts to build a library of repeatable solutions and empower first-line support capabilities.
Collaborate directly with the core development teams to reproduce and diagnose root-cause defects, contributing detailed findings and logs to the reliability backlog for sprint prioritisation.
Proactively analyse system performance and observability data (metrics, logs, traces) to identify degradation trends, capacity bottlenecks, and potential failures before they impact customers.
Participate in the formal commissioning gate process, performing operational readiness reviews on system architecture documents, monitoring plans, and runbooks for newly delivered projects.
Participate in a shared on-call rotation to provide timely, expert response for critical after-hours incidents.
Senior engineers are expected to mentor junior support staff, lead post-incident reviews, and drive continuous improvement initiatives within the Post-Delivery Engineering team.

Qualifications

Degree or postgraduate degree in Information Systems, Computer Science, Computer Engineering, or a related technical field.
Demonstrable experience in a Site Reliability Engineering (SRE), Application Support, or technical operations role supporting complex, cloud-native software systems.
Cloud & Infrastructure: Hands-on proficiency with a major cloud platform (AWS or Azure), including core services like EC2, S3, VPC, IAM, and managed container services (EKS, AKS). Strong understanding of cloud networking principles.
Containerisation: Deep experience with Docker and Kubernetes. Must be able to troubleshoot pod failures, inspect container logs, understand deployment manifests (YAML), and debug orchestration issues.
Observability & Monitoring: Practical experience using modern observability platforms (e.g., Datadog, Grafana, Prometheus, Loki). Ability to build dashboards, write complex queries, and interpret metrics, logs, and traces to pinpoint issues.
Automation & Scripting: Strong scripting skills in at least one language (Python, Golang, or Bash) for the purpose of writing diagnostic tools, automating repetitive tasks, and building remediation workflows.

Some experience with or good knowledge in the following will be advantageous

AI/ML Systems Support: Experience supporting machine learning systems in production, including troubleshooting ML inference APIs, monitoring data pipelines (e.g., Apache Airflow, Kubeflow), and understanding the failure modes of ML models.
Infrastructure-as-Code (IaC): Familiarity with tools like Terraform, Ansible, or Helm for managing cloud resources and application deployments.
Database Operations: Experience with SQL and NoSQL databases, including query analysis, performance tuning, and basic administration tasks.
CI/CD & DevOps: Understanding of continuous integration and continuous delivery pipelines and their role in the software development lifecycle.
Incident Management: Formal experience with incident management frameworks (e.g., ITIL) and conducting blameless post-incident reviews.

Why you should apply: