Overview::
We are seeking a hands-on Senior Cloud DataOps & MLOps Engineer to help build and operate a modern AWS-based monitoring and machine learning platform environment. In this role, you will bridge the gap between data operations, machine learning, and cloud infrastructure to enable scalable, observable, and reliable systems.
Key focus areas include::
- Setting up and managing Amazon SageMaker environments
- Implementing centralized monitoring and observability using Prometheus and Grafana
- Building logging and metrics pipelines across AWS applications and services
- Supporting cloud platform automation, operational monitoring, and MLOps enablement
The ideal candidate will have a strong cloud engineering foundation and practical experience operating enterprise-scale AWS environments.
Key Responsibilities::
- Design and implement AWS cloud infrastructure and operational platforms
- Configure and manage Amazon SageMaker environments to support ML workflows
- Set up centralized monitoring, metrics collection, and alerting using Prometheus and Grafana
- Build logging and observability pipelines across applications and AWS services
- Develop dashboards, alerts, and operational monitoring standards
- Support platform automation using Infrastructure-as-Code (IaC) and CI/CD pipelines
- Work closely with application, data, and platform teams to improve reliability and operational visibility
- Troubleshoot production issues and optimize cloud platform performance
Required Skills & Experience::
Mandatory
- Strong hands-on experience with AWS cloud services including:
1.IAM, VPC, EC2, S3, CloudWatch
2.ECS or EKS
3.Security and networking fundamentals
1.Prometheus and Grafana
- Monitoring and observability platforms
- Centralized logging architectures
- Experience with Amazon SageMaker or other MLOps platforms
- Proficiency with Terraform, CloudFormation, or similar IaC tools
- Scripting experience in Python or Bash
- Experience supporting production cloud environments
Preferred::
- Kubernetes / EKS experience
- OpenTelemetry experience
- CI/CD pipeline implementation
- DataOps or platform engineering background
- AWS certifications (e.g., Solutions Architect, DevOps Engineer, or Machine Learning Specialty)
Ideal Candidate Profile::
We are looking for someone who is:
- Hands-on and technically strong – comfortable diving into infrastructure and code
- Enterprise-ready – experienced in navigating complex cloud environments
- A strong troubleshooter – able to diagnose and resolve production issues efficiently
- Proactive and independent – takes ownership and drives tasks to completion
- A clear communicator – able to collaborate effectively with technical teams and business stakeholders
Technology Environment::
AWS | SageMaker | Prometheus | Grafana | CloudWatch | Terraform | Python | Kubernetes | Docker | CI/CD
Nice-to-Have::
- Experience supporting AI/ML platforms in production
- Experience with enterprise observability frameworks (e.g., OpenTelemetry, Jaeger)
- Multi-account AWS environment experience (e.g., AWS Organizations, Control Tower)
- FinOps or cloud governance exposure (e.g., AWS Cost Explorer, Budgets, tagging strategies)