Search by job, company or skills

U

Site Reliability Engineer

4-6 Years
SGD 10,000 - 12,000 per month
new job description bg glownew job description bg glownew job description bg svg
  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Responsibilities

  • Design, build, and maintain CI/CD pipelines specifically tailored for machine learning models and applications, including automated testing, model versioning, and deployment strategies.
  • Implement and manage infrastructure as code (IaC) solutions using tools like Terraform or CloudFormation to provision and configure cloud resources (AWS, Azure, GCP) for Machine Learning workloads.
  • Collaborate with Data Scientists, Machine Learning Engineers, and Software Developers to understand their infrastructure needs and translate them into scalable and reliable solutions.
  • Monitor the performance and health of Machine Learning systems, establish alerting mechanisms, and troubleshoot production issues related to infrastructure, deployments, and model serving.
  • Optimize ML infrastructure for cost-efficiency, performance, and security, leveraging containerization (Docker, Kubernetes) and serverless technologies.
  • Develop and maintain documentation for DevOps processes, tools, and infrastructure configurations.
  • Promote best practices for security, reliability, and scalability within the Machine Learning development lifecycle.
  • Evaluate and integrate new technologies and tools to enhance the Machine Learning DevOps ecosystem.


Job Requirements

  • Bachelor's degree in Computer Science, Engineering, or a related technical field.
  • 4+ years of experience in a DevOps, SRE, or Machine Learning Ops role.
  • Strong proficiency in at least one major cloud platform (AWS, Azure, or GCP), including experience with compute, storage, networking, and security services.
  • Extensive experience with CI/CD tools (e.g., Jenkins, GitLab CI, Azure DevOps, GitHub Actions).
  • Demonstrated expertise in containerization technologies (Docker) and orchestration platforms (Kubernetes).
  • Solid understanding of infrastructure as code (IaC) principles and tools (e.g., Terraform, CloudFormation, Ansible).
  • Proficiency in scripting languages such as Python or Bash.
  • Familiarity with machine learning concepts, MLOps principles, and experience deploying ML models into production.
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack)

More Info

Job Type:
Industry:
Employment Type:

Job ID: 144739449

Similar Jobs