Responsibilities
ByteDance will be prioritizing applicants who have a current right to work in Singapore, and do not require ByteDance's sponsorship of a visa. Team Introduction The mission of our AML team is to push next-generation machine learning algorithms and platforms for the recommendation system, ads ranking and search ranking in our company. We also drive substantial impact on core businesses of the company. Responsibilities: 1. Resource Efficiency Optimization in Distributed Orchestration and Scheduling: - Develop and extend distributed orchestration frameworks within the Kubernetes/Godel ecosystem. Select appropriate frameworks based on different business scenarios, and optimize cluster utilization and load balancing strategies according to the specific characteristics of each scenario - Integrate and expand AutoScaling and automatic parallelization capabilities for various models and tasks. Employ load modeling and analytic methods for different models to automatically optimize resource requests, achieving large-scale improvements in resource usage efficiency and global optimality - Responsible for preemption and re-scheduling mechanisms for services with different prioritties, and manage automatic resource multiplexing across different clusters and resource types handle scheduling and load adaptation across multi-datacenter, multi-region, and multi-cloud environments. 2. Building Training System Architecture for Next-Generation Ultra-Large and Ultra-Deep Recommendation Models: - Develop a flexible, elastic and robust distributed training runtime focused on hyper-scaled embeddings and large-scale GPU training - Design and optimize distributed computing APIs and runtimes geared towards future recommendation and ads model paradigms (e.g., reinforcement learning, fine-tuning and/or distillation) - Collaborate with platform teams to enhance the diagnosability and usability of distributed training systems. 3. Constructing Online Orchestration Architecture for Next-Generation Recommendation Systems: - Build a robust and stable distributed model inference architecture for online learning scenarios involving hyper-scaled embeddings - Optimize the usability of online recommendation and ads model architectures and MLops workflows.
Qualifications
Minimum Qualifications - Bachelor's degree or above, majoring in Computer Science, Engineering or related fields. - Strong programming and coding experience with at least one modern language such as Golang, Python. - Experience contributing to the large scale distributed systems, multi-tenant systems (architecture, reliability and scaling). - Strong analytical abilities and problem solving. - Good communication, self-motivation, engineering practice, documentation, etc. - At least 3 years of relevant experience. Preferred Qualifications - Familiar with large-scale distributed scheduling systems like Kubernetes, Yarn, Flink and/or Spark - Familiar with opensourced orchestration frameworks like VeRL, vLLM, Ray or TFX, etc.