Search by job, company or skills
The Lead - Product Operations is responsible for ensuring that DRPL's AI-driven video understanding and video modification systems run smoothly across both VoD and Live environments. This is a hands-on, highly technical role that blends dataset preparation, AI model training support, software testing, production monitoring, and DevOps practices. The role requires someone who can bridge the gap between engineering and operations, ensuring the health of our AI pipelines and infrastructure on a daily basis.
You will own operational excellence for production systems, ensuring data readiness, pipeline reliability, and proactive issue resolution across both cloud and on-premise (edge) deployments.
. Prepare, curate, and validate datasets for AI model training.
. Coordinate with AI engineers to ensure model training pipelines run efficiently.
. Validate AI model outputs against quality benchmarks before deployment.
. Ensure end-to-end health of AI pipelines (data ingestion, processing, inference, output delivery).
. Monitor system uptime and performance using observability tools.
. Configure alerts and escalation workflows for critical failures or performance degradation.
. Maintain deployment documentation, runbooks, and standard operating procedures.
. Deploy and maintain AI and software components on cloud platformsas well as bare-metal/edge systems.
. Install and configure operating systems, dependencies, and runtime environments.
. Automate deployment and monitoring where possible.
. Conduct integration and functional testing for new releases of AI models and platform features.
. Work with QA engineers to create and maintain test plans for AI workflows.
. Work closely with Engineering Leads and Architects to implement operational improvements.
. Coordinate with Product Managers to align operational tasks with business priorities.
. Strong technical background with hands-on experience in Linux systems administration, shell scripting, and automation tools.
. Knowledge of DevOps tools (Docker, Kubernetes, CI/CD systems, monitoring tools like Prometheus/Grafana).
. Understanding of AI/ML workflows including dataset preparation, training pipelines, and inference systems.
. Experience with cloud services (AWS, GCP, Azure) and bare-metal / edge deployments.
. Familiarity with software testing practices and QA methodologies.
. Ability to troubleshoot software, model, and system issues across the stack.
. Strong analytical and problem-solving skills proactive in identifying and resolving operational issues.
. Comfortable working with both structured and unstructured data.
. Exposure to video processing or AI-based media workflows.
. Experience with GPU-accelerated systems and drivers.
. Familiarity with database systems and message queues (PostgreSQL, MongoDB, Kafka, Redis, etc.).
. Scripting in Python for automation and data manipulation.
Date Posted: 25/08/2025
Job ID: 124642949