Search by job, company or skills

M

Site Reliability Engineer

3-5 Years
SGD 6,500 - 8,500 per month
new job description bg glownew job description bg glownew job description bg svg
  • Posted 5 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Responsibilities:

  • Manage the end-to-end feedback loop for incidents, including rapid triage, effective resolution, and the facilitation of post-incident reviews to ensure closure and prevent recurrence.
  • Execute upgrades and deployments strictly adhering to SOPs, while actively leveraging Machine Learning and Infrastructure expertise to refine, automate, and improve these processes for greater efficiency.
  • Analyse all kinds of user needs related to machine learning systems provided by applied machine learning department, through on-call shifting or any other mechanisms, and propose customer-oriented solutions .
  • Serve as the primary responder for the Machine Learning Production Platform, taking ownership of system availability, health monitoring, and immediate incident response to ensure high reliability.
  • Collaborate with software engineers to implement and deploy customer-oriented machine learning framework related solutions.
  • Update software, enhances existing software capabilities, and develop or deploy software testing, deployment, capacity management and validation procedures.
  • Work with computer hardware engineers to integrate hardware and software systems, and troubleshoot specifications and performance requirements.


Requirements:

  • Bachelor's Degree in Computer Science or equivalent with 3+ years of relevant experience
  • Proven experience in analyzing and troubleshooting distributed systems.
  • Prior experience designing or maintaining large-scale systems.
  • Possesses at least one scripting skills (such as Python, Go, or Shell/Bash) to automate repetitive operational tasks.


Good to have:

  • Experience in operating MLOps platforms and toolkits such as Kubeflow, MLflow, Feast, or Ray.
  • Deep understanding of Linux operating system internals, container technologies (Docker/Containerd) and orchestration platforms (Kubernetes) in a production environments.
  • Basic understanding of Machine Learning concepts and familiarity with frameworks such asTensorFlow Serving, TorchServe, or Triton Inference Server
  • Experience defining and managing Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, and practicing Chaos Engineering.

More Info

Job Type:
Industry:
Employment Type:

Job ID: 143741373

Similar Jobs