Search by job, company or skills

B

Backend Engineer - AML Framework Development (Search, Ads, and Recommendation Direction)

3-5 Years
Save
  • Posted 12 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Responsibilities

About The Team The mission of our AML team is to push the next-generation AI infrastructure and recommendation platform for the ads ranking, search ranking, live & e-Commerce ranking in our company. We also drive substantial impact on core businesses of the company. Responsibilities - Responsible for the iteration of the underlying architecture of the large model inference engine and end-to-end GPU performance optimization, through means such as operator fusion and compilation optimization, deeply optimizing GPU memory access, computing pipeline, and Stream asynchronous scheduling, eliminating inference computing bottlenecks, improving single-card inference throughput, and reducing inference latency. - Adapt to all series of GPU/NPU hardware architectures, refine the universality of the inference engine and hardware adaptability, and build a high-performance, low-loss underlying base for large model inference. - Lead the design, development, and optimization of distributed parallel solutions for large model inference scenarios, with a focus on implementing multi-dimensional parallel strategies such as tensor parallelism (TP), pipeline parallelism (PP), sequence parallelism, and MoE expert parallelism, to address core issues such as multi-card splitting and deployment of ultra-large models, high cross-card communication overhead, load imbalance, and low parallel efficiency. - Follow up on cutting-edge technologies such as global large model inference, GPU high-performance computing, distributed parallelism, and cache optimization, benchmark against mainstream inference frameworks such as vLLM and TensorRT-LLM, complete the implementation of solutions and technological innovation, continuously iterate and optimize the performance and cost advantages of the inference system, and build the core technological barriers of the team.

Qualifications

Minimum Qualification(s) - Bachelor's degree in Computer Science or equivalent with 3+ years of relevant experience - Solid foundation in computer low-level knowledge, proficient in C/C++ and Python programming, skilled in CUDA programming and familiar with GPU hardware architecture principles, and well-versed in GPU memory models, computing scheduling, and communication mechanisms - Proficiently master the underlying development and implementation of various basic operators in Deep learning, be well-versed in GPU adaptation and optimization of core operators such as matrix operations, normalization, and activation functions, and be able to independently complete operator handwritten reconstruction, memory access optimization, vectorization acceleration, and precision alignment to ensure high performance and high stability of operator inference. - Familiar with the end-to-end process of deep learning inference compilation, understand core compilation technologies such as computational graph optimization, operator fusion, constant folding, memory reuse, scheduling optimization, and quantization compilation, and be able to simplify the inference process, reduce GPU memory usage, and decrease inference latency through compilation-level improvements, thereby significantly enhancing the throughput efficiency of model inference. - Proficient in using GPU performance analysis tools such as Nsight and Profiler, able to accurately identify performance bottlenecks such as computing power waste, memory access blockage, and scheduling redundancy during the inference process, possess the thinking of software-hardware collaborative optimization, capable of outputting systematic optimization solutions and completing implementation iterations, and adaptable to the requirements of industrial-level high-concurrency, low-latency inference business. Preferred Qualification(s) - Thoroughly understand the core principles of large model inference, proficiently master the core technologies of model parallelism, have experience in implementing distributed inference solutions such as tensor parallelism, pipeline parallelism, and sequence parallelism, and be familiar with multi-card communication, load balance, and parallel efficiency optimization methods. - Those with experience in secondary development and Performance optimization of mainstream large model inference frameworks such as vLLM, SGLang, TensorRT-LLM, etc. are preferred. - Familiarity with model computation efficiency optimization solutions for mainstream deep learning frameworks.

More Info

Job Type:
Function:
Employment Type:

About Company

ByteDance is a technology company operating a range of content platforms that inform, educate, entertain and inspire people across languages, cultures, and geographies.
Dedicated to building global platforms of creation and interaction, ByteDance now has a portfolio of applications available in over 150 markets and 75 languages. For example, TikTok, Helo, Vigo Video, Douyin, and Huoshan.
Dedicated to building global platforms of creation and interaction, ByteDance now has a portfolio of applications available in over 150 markets and 75 languages. For example, TikTok, Helo, Vigo Video, Douyin, and Huoshan.

Job ID: 104356491

Similar Jobs

Pasir Panjang, Singapore

Skills:

JavaDV360PythonGoogle AdsGA4Google Cloud servicesGoogle Marketing PlatformSA360

Singapore

Skills:

JavaCPythonGo

Singapore

Skills:

causal inference Machine LearningSparkStatistical ModellingPythonSqlAds Auction TheoryUplift ModellingMathematical OptimizationExperimental Design

Singapore

Skills:

HTMLJavaPythonGolangCSSTypescriptJavascriptdata analysis toolsads buying platforms

Singapore

Skills:

SparkPytorchHivePythonTensorflowMachine LearningMMoEvLLMDINDSSMhugging face