Manage the end-to-end feedback loop for incidents, including rapid triage, effective resolution, and the facilitation of post-incident reviews to ensure closure and prevent recurrence.
Execute upgrades and deployments strictly adhering to SOPs, while actively leveraging Machine Learning and Infrastructure expertise to refine, automate, and improve these processes for greater efficiency.
Analyse all kinds of user needs related to machine learning systems provided by applied machine learning department, through on-call shifting or any other mechanisms, and propose customer-oriented solutions .
Serve as the primary responder for the Machine Learning Production Platform, taking ownership of system availability, health monitoring, and immediate incident response to ensure high reliability.
Collaborate with software engineers to implement and deploy customer-oriented machine learning framework related solutions.
Update software, enhances existing software capabilities, and develop or deploy software testing, deployment, capacity management and validation procedures.
Work with computer hardware engineers to integrate hardware and software systems, and troubleshoot specifications and performance requirements.
Requirements:
Bachelor's Degree in Computer Science or equivalent with 3+ years of relevant experience
Proven experience in analyzing and troubleshooting distributed systems.
Prior experience designing or maintaining large-scale systems.
Possesses at least one scripting skills (such as Python, Go, or Shell/Bash) to automate repetitive operational tasks.
Good to have:
Experience in operating MLOps platforms and toolkits such as Kubeflow, MLflow, Feast, or Ray.
Deep understanding of Linux operating system internals, container technologies (Docker/Containerd) and orchestration platforms (Kubernetes) in a production environments.
Basic understanding of Machine Learning concepts and familiarity with frameworks such asTensorFlow Serving, TorchServe, or Triton Inference Server
Experience defining and managing Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, and practicing Chaos Engineering.