Job Responsibilities:
1. Distributed Training Engineering
- Participate in the implementation of large-scale distributed training solutions.
- Lead the engineering deployment of data parallelism, model parallelism (TP/PP), and ZeRO optimization.
- Continuously tune GPU compute utilization and ensure stability of ultra-large-scale training tasks.
2. Compute Scheduling Optimization
- Deeply involved in the development and optimization of AI task scheduling logic.
- Implement fine-grained resource management, fault self-healing, and efficient Checkpoint mechanisms.
- Solve compute bottlenecks in complex gaming scenarios.
3. End-to-End Model Engineering
- Own the full pipeline from model training to inference deployment.
- Participate in operator performance profiling, model quantization, and high-performance inference pipeline construction.
- Support rapid iteration of AI in gaming business.
4. AI-Driven Engineering Evolution
- Actively adopt AI Coding technologies to improve development efficiency.
- Drive Harness Engineering practices through automated testing and engineering governance.
- Ensure ultimate reliability of underlying infrastructure.
Requirements
Education & Background
- Bachelor's degree or above in Computer Science, Systems Architecture, High-Performance Computing, or related field.
Technical Skills
- Proficiency in at least one of Python, C++, or Go.
- Deep understanding of PyTorch framework.
- Hands-on experience with mainstream distributed training technologies such as DeepSpeed and Megatron-LM.
Foundation Knowledge
- Understanding of distributed systems principles.
- Familiarity with NCCL communication library, RDMA networking, or high-performance storage is a plus.
- Knowledge of containerization infrastructure such as Docker and Kubernetes.
Engineering Experience
- Deep experience with AI Coding tools (e.g., GitHub Copilot, Cursor) is a strong plus.
- Experience with Harness Engineering (engineering governance, automated benchmarking, system stress testing) is a strong plus.
General Qualifications
- Exceptional learning ability with clear logical thinking.
- Ability to efficiently collaborate with the team to solve complex systems engineering problems.
- Strong English technical documentation reading skills.
Nice to Have
- Experience in high-performance backend architecture development.
- Hands-on project experience in LLM training/inference engineering.