Responsibilities
Team Introduction The Global Traffic Infrastructure (GTI) team leverages unified platform capabilities to manage edge infrastructure outside China (both self-built and third-party) providing standardized, compliant, scalable, and cost-effective traffic infrastructure capabilities for edge services. Our vision is to build a global edge traffic infrastructure platform and become the long-term cornerstone of ByteDance's global edge business in terms of scale, performance, and cost. Responsibilities - SLO/SLI & Error Budget: Align with business stability goals own the overall SLO strategy and execution for the platform build and operate an SLO/SLI & Error Budget framework covering critical user journeys/services. - Release & Change Governance: Drive end-to-end release/change management across code, configuration, network and capacity establish standardized change reviews, canary/phased rollout strategies, rollback mechanisms, and release window governance. - Incident Management & On-call: Participate a global 24x7 follow-the-sun on-call model unify incident processes (triage, response, escalation, communication) to reduce blast radius and recovery time. - Postmortems & Stability Programs: Participate major incident postmortems drive cross-team stability programs (e.g., chaos engineering, capacity stress testing, SPOF elimination) distill reusable best practices. - Design for Operability: Partner closely with platform engineering and network/infrastructure teams to shift-left operability and reliability requirements into architectural design and development workflows.
Qualifications
Minimum Qualifications - At least 3 years of experience in SRE/DevOps/Production Engineering/Infrastructure Backend roles, supporting large-scale online systems. - SRE Methodology: Strong grasp of SLO/SLI, Error Budget, incident management, and postmortems, with proven production adoption experience in large-scale online systems. - CI/CD & Progressive Delivery: Deep understanding of CI/CD pipelines and deployment patterns such as blue-green, canary, phased rollout, configuration management, and feature flags able to design and promote a unified change governance system. - Observability: Hands-on experience across metrics, logs, tracing and profiling familiarity with eBPF-based approaches to improve observability and troubleshooting efficiency. - Cloud-native & Networking Fundamentals: Understanding of Kubernetes, CNI, traffic management, and global LB/Anycast practical exposure to self-built CDN/edge node runtimes. Preferred Qualifications - Proven track record of scaling and operating a global follow-the-sun on-call and incident command process across regions/time zones. - Depth in eBPF-based observability and diagnosis toolchains, and/or edge traffic infrastructure (global load balancing/Anycast, CDN/edge runtime) operations. - Demonstrated ability to communicate effectively with global teams across multiple time zones experience complex cross-border technical collaborations.