Search by job, company or skills

Byte Dance

Site Reliability Engineer - ARK Large Model Platform (Singapore)

Early Applicant
  • Posted 26 days ago
  • Be among the first 10 applicants
5-7 Years

IT/Computers - Software

Job Description

Responsibilities

ByteDance will be prioritizing applicants who have a current right to work in Singapore, and do not require ByteDance's sponsorship of a visa. About the Team The Applied Machine Learning (AML) - Enterprise team provides machine learning platform products on VolcanoEngine with cloud native resource scheduling system which intelligently orchestrates different tasks and jobs with minimised costs of every experiment and maximised resource utilisation, rich modelling tools including customised machine learning tasks and web IDE, and multi-framework high performance model inference services. In 2021, through VolcanoEngine, we released this machine learning infrastructure to the public, to provide more enterprises with reduced costs of computation power, lower barriers to machine learning engineering and deeper developments in AI capabilities. Responsible for Ark Large Model Platform development on Volcano Engine, researching systematic solutions on large model solution implementations and applications in various industries, striving to reduce the IT cost of large model applications, meeting the users ever-growing demand for intelligent interaction and improving the lifestyle and communications of users in the future world. - Manage and oversee the stability of both control and data aspects of large-scale model systems through effective DevOps practices. - Develop and enhance observability systems for monitoring the stability of large model systems, ensuring high reliability and performance. - Handle super large-scale cluster management and ensure efficient operation and maintenance of large model systems.

Qualifications

Minimum Qualifications - B. Sc or higher degree in Computer Science or related fields from accredited and reputable institutions. - Minimum of 5 years of R&D experience in the fields of cloud computing or large-scale model systems. - Proficiency in cloud-native technologies and understanding of the relevant technology stack. - Expertise in one of the following programming languages: Golang, Python, or Java, with the ability to use it proficiently in a professional setting. - Familiarity with cloud-native technologies for log collection, monitoring, and alerting. Preferred Qualifications: - Prior experience in the construction and maintenance of stability systems for large-scale infrastructures. - Experience in operating and maintaining large-scale systems. - Experience with infrastructure as code, particularly Terraform, is highly desirable.

More Info

Date Posted: 04/09/2025

Job ID: 125464309

Report Job

About Company

ByteDance is a technology company operating a range of content platforms that inform, educate, entertain and inspire people across languages, cultures, and geographies.
Dedicated to building global platforms of creation and interaction, ByteDance now has a portfolio of applications available in over 150 markets and 75 languages. For example, TikTok, Helo, Vigo Video, Douyin, and Huoshan.
Dedicated to building global platforms of creation and interaction, ByteDance now has a portfolio of applications available in over 150 markets and 75 languages. For example, TikTok, Helo, Vigo Video, Douyin, and Huoshan.

View More
Last Updated: 30-09-2025 04:16:39 PM
Home Jobs in Singapore Site Reliability Engineer - ARK Large Model Platform (Singapore)

Similar Jobs