
Search by job, company or skills
What the Role Entails
- Design and implement high-performance orchestration systems to automate the deployment, scaling, and lifecycle management of large-scale model training and real-time inference services
- Develop scheduling algorithms to optimize the utilization of global GPU/CPU clusters. Focus on improving resource allocation efficiency and managing the complexities of heterogeneous hardware
- Participate in development of training-related requirements (incremental training and API for online learning) and enhance framework efficiency
- Participate in development of pipelines for offline-to-online synchronization. Ensure strict data consistency and minimize model-update latency to ensure the most current intelligence is served
- Collaborate with cross-functional teams to ensure platform features are available, secure, and compliant for a global user base.
Who We Look For
- Currently enrolled in an undergraduate or graduate degree program in Computer Science, Information Systems, or related fields
- Experience in machine learning system practice and open-source ML orchestration frameworks (e.g. Ray/TFX/Kubeflow)
- Proficiency in backend software design, development, and deployment practices with at least one of the following programming languages: Golang, Python, C++, or Java
- In-depth knowledge of distributed system principles (e.g., consistency protocols like Paxos or Raft, distributed locking, and caching strategies)
- Familiarity with Linux operating system and common system tools
- Strong ownership, customer-oriented values, and integrity demonstrated
- Good programming discipline, fast-learning ability, and teamwork skills
- Prior internet industry work or internship experience is a plus
- Fluency in both English and Mandarin Chinese for effective communication with international stakeholders
Job ID: 150707815
We don’t charge any money for job offers