Machine Learning Systems Engineer (MLSys)

Fresher

SGD 6,000 - 9,000 per month

Save

Early Applicant

Job Description

System Development & Maintenance
Contribute to the development, optimization, and maintenance of core components of the machine learning platform, including feature stores, experiment tracking systems, model registries, workflow orchestration, and serving frameworks
Training Efficiency Optimization
Assist in optimizing the performance of distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, FSDP) on large-scale clusters, addressing challenges such as resource scheduling and communication bottlenecks
Inference Performance Optimization
Participate in model deployment and serving, including performance profiling and acceleration through model compilation (e.g., TVM, TensorRT), operator optimization, computation graph optimization, and batching strategies
Infrastructure Support
Leverage technologies such as containerization (Docker), orchestration (Kubernetes), and monitoring (Prometheus/Grafana) to improve observability, reliability, and resource utilization of ML systems
Tooling & Developer Productivity
Build and maintain internal tools to improve engineering efficiency, such as automated evaluation systems, stress testing tools, and debugging utilities

Bachelor's degree or above in Computer Science, Software Engineering, Electronic Engineering, or related fields

Solid foundation in computer science fundamentals: operating systems, computer networks, data structures, and algorithms
Strong programming skills, with proficiency in Python experience with Go or C++ is a strong plus
Basic understanding of software engineering principles, including design patterns and clean coding practices

Familiarity with Linux development environments, including common commands and shell scripting
Experience with at least one mainstream deep learning framework (preferably PyTorch), with curiosity about its underlying mechanisms
Basic hands-on experience with containerization (Docker), CI/CD pipelines, and version control (Git)

Strong passion for engineering and building high-performance, highly available systems
Excellent problem-solving and debugging skills, with a mindset for optimization
Good communication and teamwork skills, able to collaborate effectively across cross-functional teams
Strong curiosity and willingness to deeply understand machine learning algorithms and their integration with system engineering

Familiarity with Kubernetes and cloud-native technologies
Experience with model serving frameworks such as Triton, TensorFlow Serving, or TorchServe
Understanding of compiler fundamentals (e.g., LLVM), high-performance computing (HPC), or hardware acceleration (GPU/ASIC)
Contributions to open-source projects or relevant system/infrastructure projects on GitHub
Experience with large-scale data processing (e.g., Spark, Flink) or storage systems