Search by job, company or skills

L

[LPS] Sr Operation Mgmt Specialist

Fresher
new job description bg glownew job description bg glownew job description bg svg
  • Posted 7 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Description and Requirements

Job Summary

The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, and performance of enterprise systems in a managed services (Day 2) environment.

The role operates as a centralised reliability function across application, infrastructure, and vendor support layers, governing operational activities to ensure that incidents, changes, and patching activities are executed without impacting service stability or SLA commitments.

The SRE works closely with Application Engineers (L1.5) and Application Vendors (L2), providing oversight, risk control, and engineering-driven improvements to maintain a stable and resilient production environment.

Key Responsibilities

1. Reliability & Service Assurance

  • Own end-to-end service reliability, including availability, performance, and system stability
  • Define and track reliability metrics (e.g., uptime, latency, error rates)
  • Ensure SLA compliance through proactive monitoring and operational governance
  • Establish service health indicators and early warning mechanisms

2. Monitoring & Observability

  • Design and implement monitoring, logging, and alerting frameworks across application and infrastructure layers
  • Define alert thresholds and reduce alert noise to improve signal quality
  • Develop dashboards and reporting for real-time visibility of system health
  • Continuously enhance observability coverage across services

3. Incident Management & RCA

  • Lead major incident management (P1/P2) as incident commander
  • Perform end-to-end root cause analysis (RCA) across application, infrastructure, and vendor domains
  • Coordinate with Application Engineers and Vendors for issue resolution
  • Drive preventive and corrective actions to reduce incident recurrence

4. Change & Patch Governance

  • Assess operational risks associated with changes, releases, and patching activities
  • Work with Application Engineers (L1.5) and Vendors (L2) to ensure safe execution of application patches
  • Perform pre- and post-change validation to ensure system stability
  • Govern Go/No-Go decisions and support rollback planning in case of service degradation

5. Performance & Capacity Management

  • Monitor and optimise system performance across application and infrastructure layers
  • Conduct capacity planning and forecasting to ensure scalability and resilience
  • Identify and address performance bottlenecks proactively

6. Automation & Continuous Improvement

  • Drive automation of operational processes, including monitoring, recovery, and validation
  • Implement self-healing and resilience mechanisms where applicable
  • Develop and maintain operational runbooks and automation scripts
  • Continuously improve system reliability through engineering practices

7. Collaboration & Governance

  • Work closely with Application Engineers (L1.5) for execution of operational activities
  • Collaborate with Vendors (L2) for defect resolution and product-level fixes
  • Ensure compliance with governance, security, and audit requirements
  • Support service reviews, reporting, and continuous improvement initiatives

Requirements

Core Requirements

  • Experience in Site Reliability Engineering, DevOps, or production operations in enterprise environments
  • Strong understanding of cloud platforms (preferably AWS)
  • Experience with monitoring and observability tools
  • Strong troubleshooting capability across application and infrastructure layers
  • Experience in incident management and root cause analysis
  • Familiarity with ITIL processes (incident, problem, change management)

Preferred

  • Experience in system integrator or managed services (Day 2 operations) environment
  • Exposure to enterprise applications (e.g., IWMS platforms such as Archibus or similar)
  • Experience with automation and scripting (Python, PowerShell, etc.)
  • Knowledge of performance tuning and capacity planning

Key Competencies

  • Strong analytical and problem-solving skills
  • Ability to lead during high-pressure incidents
  • Structured and governance-driven mindset
  • Proactive approach to reliability and continuous improvement
  • Strong stakeholder coordination across internal teams and vendors

Success Measures

  • SLA compliance (availability, uptime, performance)
  • Reduction in MTTR and incident recurrence
  • Stability during patching and change activities
  • Increased automation and reduced manual intervention
  • Improved system performance and resilience

About Company

Why Work at Lenovo We are Lenovo. We do what we say. We own what we do. We WOW our customers. Lenovo is a US$69 billion revenue global technology powerhouse, ranked #196 in the Fortune Global 500, and serving millions of customers every day in 180 markets. Focused on a bold vision to deliver Smarter Technology for All, Lenovo has built on its success as the world's largest PC company with a full-stack portfolio of AI-enabled, AI-ready, and AI-optimized devices (PCs, workstations, smartphones, tablets), infrastructure (server, storage, edge, high performance computing and software defined infrastructure), software, solutions, and services. Lenovo's continued investment in world-changing innovation is building a more equitable, trustworthy, and smarter future for everyone, everywhere. Lenovo is listed on the Hong Kong stock exchange under Lenovo Group Limited (HKSE: 992) (ADR: LNVGY). This transformation together with Lenovo's world-changing innovation is building a more inclusive, trustworthy, and smarter future for everyone, everywhere. To find out more visit www.lenovo.com , and read about the latest news via our StoryHub .

Job ID: 145622061

Similar Jobs

Early Applicant
Early Applicant