Manage and maintain infrastructure deployed on Alibaba Cloud and AWS.
Deploy, configure, scale, and upgrade services like Kubernetes (K8s), RDS, Redis, Kafka, and MongoDB.
Design, configure, and optimize cloud network structures (e.g., VPCs, subnets, security groups) to ensure secure, cross-region, and cross-VPC communication.
Configure and maintain VPN and remote access systems to ensure secure, encrypted, and highly available data transfer.
Provide infrastructure technical support to business teams (e.g., development and product teams) and resolve operational issues in a timely manner.
Write and maintain automation scripts (e.g., Bash, Python) to improve operational efficiency and reduce manual intervention.
Design and implement an automated operations platform to support features like self-healing, elastic scaling, log aggregation, monitoring alerts, and configuration management.
Automate resource management and ensure environment consistency using IaC (Infrastructure as Code) tools like Terraform and Ansible.
Design and maintain GitOps-based CI/CD pipelines to improve release efficiency and system delivery quality.
Build and optimize a comprehensive observability system covering logs, metrics, and distributed tracing (using tools like Prometheus, Grafana, ELK/EFK, Jaeger, and OpenTelemetry).
Track and evaluate cloud-native technology trends (e.g., Serverless architecture, Service Mesh) to drive infrastructure modernization.
Continuously monitor systems, plan capacity, and tune performance to ensure high availability, security, and scalability in the production environment.
Requirements
More than 10 years of experience in infrastructure/operations, with a solid understanding of computer fundamentals and system architecture.
Deep understanding of mainstream cloud platform architectures and services (e.g., Alibaba Cloud, AWS, GCP), with practical deployment and operations experience.
Familiar with cloud-native architecture concepts, with experience supporting microservice systems and knowledge of service registration, discovery, governance, and configuration centers.
Familiar with Serverless technology (e.g., AWS Lambda, Aliyun Function Compute) and its application and operational characteristics in specific scenarios.
Mastery of observability system construction methodologies, with the ability to independently build and maintain components for monitoring, logging, alerting, and link tracing.
Proficient in Linux system administration and tuning, and skilled in using Bash, Python, and other languages for operations automation.
Proficient in container orchestration and runtime platforms like Kubernetes and Docker, and understands their principles and debugging methods.
Familiar with the design and maintenance of CI/CD toolchains (e.g., Jenkins, GitLab CI/CD, ArgoCD).
Proficient in network fundamentals (e.g., TCP/IP, VPC, VPN, firewalls) and holds a CCNA certification or equivalent capability.
Familiar with the deployment, operation, and monitoring of common middleware (e.g., MySQL, Redis, Kafka, MongoDB).
Possesses a DevOps mindset and can continuously improve system delivery and stability through automation and standardization.
Strong communication and collaboration skills, with a strong sense of responsibility and problem-solving ability.