Design, build, and maintain CI/CD pipelines specifically tailored for machine learning models and applications, including automated testing, model versioning, and deployment strategies.
Implement and manage infrastructure as code (IaC) solutions using tools like Terraform or CloudFormation to provision and configure cloud resources (AWS, Azure, GCP) for Machine Learning workloads.
Collaborate with Data Scientists, Machine Learning Engineers, and Software Developers to understand their infrastructure needs and translate them into scalable and reliable solutions.
Monitor the performance and health of Machine Learning systems, establish alerting mechanisms, and troubleshoot production issues related to infrastructure, deployments, and model serving.
Optimize ML infrastructure for cost-efficiency, performance, and security, leveraging containerization (Docker, Kubernetes) and serverless technologies.
Develop and maintain documentation for DevOps processes, tools, and infrastructure configurations.
Promote best practices for security, reliability, and scalability within the Machine Learning development lifecycle.
Evaluate and integrate new technologies and tools to enhance the Machine Learning DevOps ecosystem.
Job Requirements
Bachelor's degree in Computer Science, Engineering, or a related technical field.
4+ years of experience in a DevOps, SRE, or Machine Learning Ops role.
Strong proficiency in at least one major cloud platform (AWS, Azure, or GCP), including experience with compute, storage, networking, and security services.