Overview
We are seeking an experienced
Senior GenAI Platform Engineer / OpenShift SME to lead and manage enterprise-scale infrastructure supporting GenAI applications. This role focuses on OpenShift platform engineering, hybrid cloud environments, disaster recovery (DR), and security for highly scalable and resilient AI platforms.
Requirements
- 10+ years of experience in infrastructure engineering / platform engineering.
- Strong expertise in managing OpenShift (OCP) in enterprise production environments.
- Hands-on experience in infrastructure sizing, capacity planning, and performance tuning for AI workloads.
- Experience supporting Oracle Database from an infrastructure/application standpoint.
- Strong knowledge of certificate management, secrets handling, and key management.
- Experience with CI/CD pipelines and infrastructure automation.
- Solid background in security, vulnerability management, and compliance.
- Proven experience in designing and implementing Disaster Recovery (DR) solutions.
- Experience with AWS cloud services and hybrid cloud environments.
- Strong experience with Docker and Kubernetes.
- Excellent coordination and stakeholder management skills across cross-functional teams.
Key Responsibilities
- Lead and manage end-to-end infrastructure for enterprise GenAI applications hosted on OpenShift (OCP).
- Own capacity planning, sizing, and performance optimization of OpenShift clusters and related infrastructure components.
- Manage and optimize infrastructure including Oracle DB, Redis, Elastic DB, PostgreSQL, Dell ECS storage, and Linux environments (RedHat/Ubuntu).
- Design and implement Disaster Recovery (DR) strategies ensuring high availability, resilience, and business continuity.
- Lead E2E DR setup including replication, failover, testing, and documentation in collaboration with infra and network teams.
- Manage certificate lifecycle (TLS/SSL), secrets, and key management across platforms.
- Implement vulnerability management, patching, and remediation across Kubernetes, containers, and infrastructure.
- Support and coordinate penetration testing and address security findings.
- Work with AWS services (EC2, VPC, CloudWatch, Lambda, Bedrock) in hybrid cloud environments.
- Build and maintain infrastructure automation using Terraform and CloudFormation.
- Manage observability using monitoring, logging, alerting tools, and Control-M schedulers.
- Collaborate with DevOps, Security, and Development teams for platform reliability and performance.
- (Bonus) Work with or support open-weight LLM models for AI/ML use cases.