Search by job, company or skills

U

Site Reliability Engineer

4-6 Years
SGD 10,000 - 12,000 per month

This job is no longer accepting applications

new job description bg glownew job description bg glow
  • Posted a month ago

Job Description

Job Responsibilities

  • Design, build, and maintain CI/CD pipelines specifically tailored for machine learning models and applications, including automated testing, model versioning, and deployment strategies.
  • Implement and manage infrastructure as code (IaC) solutions using tools like Terraform or CloudFormation to provision and configure cloud resources (AWS, Azure, GCP) for Machine Learning workloads.
  • Collaborate with Data Scientists, Machine Learning Engineers, and Software Developers to understand their infrastructure needs and translate them into scalable and reliable solutions.
  • Monitor the performance and health of Machine Learning systems, establish alerting mechanisms, and troubleshoot production issues related to infrastructure, deployments, and model serving.
  • Optimize ML infrastructure for cost-efficiency, performance, and security, leveraging containerization (Docker, Kubernetes) and serverless technologies.
  • Develop and maintain documentation for DevOps processes, tools, and infrastructure configurations.
  • Promote best practices for security, reliability, and scalability within the Machine Learning development lifecycle.
  • Evaluate and integrate new technologies and tools to enhance the Machine Learning DevOps ecosystem.


Job Requirements

  • Bachelor's degree in Computer Science, Engineering, or a related technical field.
  • 4+ years of experience in a DevOps, SRE, or Machine Learning Ops role.
  • Strong proficiency in at least one major cloud platform (AWS, Azure, or GCP), including experience with compute, storage, networking, and security services.
  • Extensive experience with CI/CD tools (e.g., Jenkins, GitLab CI, Azure DevOps, GitHub Actions).
  • Demonstrated expertise in containerization technologies (Docker) and orchestration platforms (Kubernetes).
  • Solid understanding of infrastructure as code (IaC) principles and tools (e.g., Terraform, CloudFormation, Ansible).
  • Proficiency in scripting languages such as Python or Bash.
  • Familiarity with machine learning concepts, MLOps principles, and experience deploying ML models into production.
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack)

More Info

Job Type:
Industry:
Employment Type:

Job ID: 144739449

Similar Jobs

Singapore

Skills:

DockerTerraformAnsibleOpenshiftKubernetesAutomation and Infrastructure-as-CodeElastic stackCrowdStrike FalconIBM Cloud-hosted VPCCloudera suiteSIEM QRadar on Cloud QRoCSystems administration – Linux and WindowsMicrosoft DefenderSoftware development and programmingQRadar on-premsystems monitoringIP networking fundamentals

Singapore, Paya Lebar

Skills:

NginxJavaPrometheusKafkaGrafanaRedisZabbixDevopsShellZookeeperPythonJVM memory managementGoInfrastructure OperationsGC mechanismsRocketMQTwemproxySite Reliability EngineeringMemcache

Singapore

Skills:

LinuxSqlPython

Singapore

Skills:

NginxJavaHadoopOpenStackShellGcpDockerAnsibleSparkKubernetesPythonAWSVolcano EngineGoFlinkLinux operating systemsAliyun

Singapore, Ubi

Skills:

Aws LambdaElkCloudformationNode.jsAWS CloudWatchJenkinsTerraformIamSplunkPythonAWS ConnectGitLab CIGitHub Actions