Infrastructure Engineer (GPU / Kubernetes / Distributed Systems)

kaishi partners pte. ltd.

Singapore

4-7 Years

SGD 12,000 - 24,000 per month

Save

Posted 19 days ago
Be among the first 10 applicants

Early Applicant

Job Description

We're working with a high-growth AI infrastructure company building foundational systems powering next-generation AI products and intelligent search infrastructure.

The team is building a search engine designed for AI agents - operating large-scale distributed systems that crawl the web, train state-of-the-art embedding models, and power high-performance vector search infrastructure. On the compute side, they operate a rapidly growing multi-million-dollar H200 GPU cluster alongside large-scale distributed batch processing systems running across tens of thousands of machines.

This is a deeply technical infrastructure role focused on building the internal platform and tooling that enables the entire engineering organization to move fast at scale.

What You'll Work On

Build and scale Kubernetes orchestration for large GPU clusters
Design distributed infrastructure powering large-scale AI workloads
Scale cloud batch job systems handling map-reduce workloads across tens of thousands of machines
Improve GPU scheduling and cluster utilization efficiency
Build observability, reliability, and internal platform tooling for production systems
Work on infrastructure supporting AI training, inference, crawling, and data processing at massive scale

What We're Looking For

Experience designing and operating large-scale infrastructure systems
Strong hands-on experience with Kubernetes in production environments
Familiarity with GPU clusters, distributed compute, or cloud batch processing systems
Strong understanding of observability, reliability engineering, and system optimization
Experience with distributed systems and performance-oriented infrastructure
Background in high-performance engineering environments is highly valued

Nice to Have

Experience with Ray, distributed batch systems, or large-scale orchestration platforms
Experience optimizing GPU utilization and scheduling
Familiarity with AWS infrastructure at scale
Exposure to AI/ML infrastructure environments

Why This Role