Search by job, company or skills

M

Site Reliability Engineer, Machine Learning Operations

3-5 Years
SGD 7,500 - 8,500 per month
new job description bg glownew job description bg glownew job description bg svg
  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Purpose of Role:

  • Frontline On-Call Ownership: Serve as the primary responder for the Applied Machine Learning Engine, taking ownership of system availability, health monitoring, and immediate incident response to ensure high reliability.
  • Incident Lifecycle Management: Manage the end-to-end feedback loop for incidents, including rapid triage, effective resolution, and the facilitation of post-incident reviews to ensure closure and prevent recurrence.
  • SOP Execution & Optimization: Execute upgrades and deployments strictly adhering to Standard Operating Procedures (SOPs), while actively leveraging Machine Learning and Infrastructure expertise to refine, automate, and improve these processes for greater efficiency.

Responsibilities:

  • Analyse all kinds of user needs related to machine learning systems provided by AML department , through oncall shifting or any other mechanisms, then propose customer oriented solutions .
  • Work with other software engineers to implement and deploy customer-oriented machine learning framework related solutions which are proposed by oneself or not .
  • Update software, enhances existing software capabilities, and develops or deploy software testing deployment capacity management and validation procedures.
  • Work with computer hardware engineers to integrate hardware and software systems and trouble-shooting specifications and performance requirements.

Minimum requirements:

  • Bachelor's degree in Computer Science or equivalent with 3+ years of relevant experience
  • Proven experience in analyzing and troubleshooting distributed systems.
  • Prior experience designing or maintaining large-scale systems.
  • Scripting skills in at least one major language (Python, Go, or Shell/Bash) to automate repetitive operational tasks.

Nice to have:

  • Experience defining and managing Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, and practicing Chaos Engineering.
  • Experience operating MLOps platforms and toolkits such as Kubeflow, MLflow, Feast, or Ray.
  • Deep understanding of Linux operating system internals or container technologies (Docker/Containerd) and orchestration platforms (Kubernetes) in a production environment.
  • Basic understanding of Machine Learning concepts and familiarity with frameworks like TensorFlow Serving, TorchServe, or Triton Inference Server

More Info

Job Type:
Industry:
Employment Type:

Job ID: 143914881