Performance Profiling & Optimization: Utilize profiling tools (e.g., Nsight, PyTorch Profiler) to identify bottlenecks in data loading, gradient computation, and communication. Implement optimizations like kernel fusion, sharding, and tiling to improve step time.
Distributed Training: Optimize distributed training pipelines using frameworks such as PyTorch Distributed.
Kernel Development: Design and maintain high-performance GPU kernels in Triton or CUDA for state-of-the-art ML workloads.
Data Pipeline Engineering: Optimize robust data loading pipelines that maximize training throughput.
What we're looking for:
Education: Bachelor's, Master's degree, or PhD in Computer Science, Computer Engineering, or a related technical discipline.
Software Engineering: Strong proficiency in Python.
ML Frameworks: Extensive hands-on experience with PyTorch.
ML Knowledge: Experience optimizing machine learning model execution during training and inference, alongside a strong understanding of fundamental machine learning concepts, architectures, and processes.
Problem Solving: Exceptional analytical and problem-solving skills, with a bias for action and a data-driven approach to technical challenges.