Role & Responsibilities
- Architect and deploy scalable AI infrastructure for LLMs and multi-modal models—covering data pipelines, model serving, and inference optimization on Kubernetes/AKS/EKS.
- Design GPU/NPU-aware cluster topologies with auto-scaling, model checkpointing, and low-latency inference SLAs for production workloads.
- Integrate MLOps toolchains (MLflow, Weights & Biases, KServe, Triton) with CI/CD pipelines to automate model deployment, rollback, and drift detection.
- Establish infrastructure-as-code (IaC) standards using Terraform or Pulumi for reproducible, secure, and auditable cloud environments.
- Collaborate with ML Engineers and Data Scientists to optimize model quantization, batching strategies, and memory footprint for production efficiency.
- Define SRE practices—observability (Prometheus/Grafana), alerting, disaster recovery—and enforce infrastructure security (IAM, network policies, pod security policies).
Skills & Qualifications
Must-Have
- Kubernetes
- Terraform
- GPU orchestration
- Model serving (Triton, KServe, Seldon)
- MLflow
- Prometheus
- Grafana
- AKS/EKS
Preferred
- Ray Serve
- ONNX Runtime
- OpenTelemetry
Benefits & Culture Highlights
- Work on bleeding-edge AI infrastructure used by Fortune 500 clients and scaling AI startups.
- On-site collaborative environment in Singapore's innovation hub with cross-functional AI & cloud teams.
- Unlimited PTO, performance bonuses, and annual learning stipend for certifications and AI conferences.
Skills: networking,infrastructure,teams,architecture,platforms,kubernetes,design,storage,data center,orchestration,cloud