About the Role
We are looking for a Senior Software Engineer to join our ML Infrastructure: Dev Enablement team. Our mission is to build a frictionless development environment that empowers our researchers and engineers to innovate on deep learning models for autonomous driving.
We manage a high-scale Cloud Development Environment (CDE) platform that provides standardized, high-performance workspaces for ML development. In this role, your impact will be two-fold:
- Platform Ownership: You will act as a key owner of our CDE platform, ensuring its scalability, reliability, and seamless integration into the ML workflow.
- Agentic Evolution: You will lead our shift toward Agentic ML Workflows. You won't just be building static tools you'll be architecting AI Agents that act as force multipliers-helping engineers automate debugging, optimize resource usage, and accelerate the journey from code to a trained model.
What You'll Be Doing
- Scale & Evolve the Dev Platform: Lead engineering efforts to support and enhance our existing CDE platform, ensuring it meets the rigorous demands of large-scale ML experimentation.
- Architect AI Agents: Design and implement LLM-powered agents capable of navigating the ML lifecycle-from automated code suggestions and log analysis to autonomous debugging of distributed training jobs.
- Infrastructure Integration: Bridge the gap between AI agents and our core infra, ensuring agents can safely and effectively interact with Kubernetes, Ray, and AWS resources.
- Collaborative Automation: Partner with ML Engineers to identify productivity killers and build agentic solutions (e.g., an agent that suggests fixes for common PyTorch distributed training errors).
- Champion Engineering Excellence: Bring software engineering rigor to the wild west of LLM development, including building evaluation frameworks for agent performance, reliability, and security.
- Mentor & Lead: Act as a subject matter expert on Agentic AI within the infrastructure team, guiding junior engineers and influencing our long-term technical roadmap.
What We're Looking For
- Experience: 5+ years of professional software engineering experience, with a focus on backend systems, distributed systems, or infrastructure.
- Agentic AI Proficiency: Hands-on experience building applications with LLM frameworks (e.g., LangChain, LangGraph, or LlamaIndex). You understand how to turn a prompt into a reliable, tool-calling agent.
- Technical Stack: Expert-level Python or Go. Deep experience with Kubernetes and Cloud is required.
- Cloud Infrastructure: Proven experience with AWS (or similar)
- Communication: Ability to translate complex infrastructure challenges into clear technical designs and collaborate across diverse engineering and research teams.
Bonus Points
- ML Ecosystem: Experience with ML orchestration and training frameworks like Ray or PyTorch.
- Remote Dev Expertise: Familiarity with Coder or other Cloud Development Environments (CDEs) at scale.
- Experiences with managing or working with high-performance compute resources (GPUs).