System Development & Maintenance Contribute to the development, optimization, and maintenance of core components of the machine learning platform, including feature stores, experiment tracking systems, model registries, workflow orchestration, and serving frameworks
Training Efficiency Optimization Assist in optimizing the performance of distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, FSDP) on large-scale clusters, addressing challenges such as resource scheduling and communication bottlenecks
Inference Performance Optimization Participate in model deployment and serving, including performance profiling and acceleration through model compilation (e.g., TVM, TensorRT), operator optimization, computation graph optimization, and batching strategies
Infrastructure Support Leverage technologies such as containerization (Docker), orchestration (Kubernetes), and monitoring (Prometheus/Grafana) to improve observability, reliability, and resource utilization of ML systems
Tooling & Developer Productivity Build and maintain internal tools to improve engineering efficiency, such as automated evaluation systems, stress testing tools, and debugging utilities
Qualifications
Education
Bachelor's degree or above in Computer Science, Software Engineering, Electronic Engineering, or related fields
Fundamental Knowledge
Solid foundation in computer science fundamentals: operating systems, computer networks, data structures, and algorithms
Strong programming skills, with proficiency in Python experience with Go or C++ is a strong plus
Basic understanding of software engineering principles, including design patterns and clean coding practices
Technical Skills
Familiarity with Linux development environments, including common commands and shell scripting
Experience with at least one mainstream deep learning framework (preferably PyTorch), with curiosity about its underlying mechanisms
Basic hands-on experience with containerization (Docker), CI/CD pipelines, and version control (Git)
Soft Skills
Strong passion for engineering and building high-performance, highly available systems
Excellent problem-solving and debugging skills, with a mindset for optimization
Good communication and teamwork skills, able to collaborate effectively across cross-functional teams
Strong curiosity and willingness to deeply understand machine learning algorithms and their integration with system engineering
Preferred Qualifications (Nice to Have)
Familiarity with Kubernetes and cloud-native technologies
Experience with model serving frameworks such as Triton, TensorFlow Serving, or TorchServe
Understanding of compiler fundamentals (e.g., LLVM), high-performance computing (HPC), or hardware acceleration (GPU/ASIC)
Contributions to open-source projects or relevant system/infrastructure projects on GitHub
Experience with large-scale data processing (e.g., Spark, Flink) or storage systems