
Search by job, company or skills
About the Role
This role builds the foundation for production-grade distributed AI training at scale. You will design reusable training recipes, benchmarking frameworks, and evaluation standards that enable large customers to train and compare models efficiently across multi-node GPU clusters.
You'll work closely with platform, orchestration, and application engineers to turn distributed training best practices into repeatable, customer-facing templates.
Job Details
Job Requirements
Job ID: 147388473
Skills:
Pytorch, DeepSpeed, Megatron-LM, NCCL, FSDP, tensor pipeline
Skills:
Data Analytics, Uipath, Predictive Analytics, Rpa, Power Automate, Sql, Tensorflow, Nlp, Git, Computer Vision, Pytorch, Python, low-code platforms, scikit-learn, Ai, Azure Cognitive Services
Skills:
Tensorflow, Nltk, Git, Rest API Development, Pytorch, Flask, FastAPI, Python, Word parsing, embeddings, Doc AI, Prompt engineering, LLMs, Hugging Face, OCR integration, semantic similarity, NLP GenAI, spaCy, unstructured datasets, CI CD exposure, Vertex AI, structured datasets, Text preprocessing
Skills:
Pytorch, Natural Language Processing, Java, Tensorflow, Python, Machine Learning Algorithms, deep learning frameworks, large language model fine-tuning, anomaly detection, Go, generative AI, time series forecasting
We don’t charge any money for job offers