Reliability Director

DayOne

Singapore

10-12 Years

Save

Posted 2 days ago
Be among the first 10 applicants

Early Applicant

Job Description

Join DayOne – Shaping the Future of Data Infrastructure

DayOne is a global leader in the development and operation of high-performance data centers. As one of the fastest-growing companies in the industry, we've built a robust presence across Asia and Europe — and we're just getting started.

As we expand into new international markets, we're looking for talented, driven individuals to join us on this exciting journey. This is more than a job — it's an opportunity to be a key contributor to our dynamic team and help shape the future of global data infrastructure.

If you're passionate about innovation, technology, and growth, we invite you to be part of DayOne's next chapter.

Position Overview

The Reliability Director is responsible for governing infrastructure reliability and systemic technical risk across the global data centre portfolio. The role leads equipment performance monitoring, failure analysis, and reliability improvement initiatives across electrical, mechanical, and control systems.

The Reliability Director establishes portfolio-wide reliability frameworks, monitors critical equipment performance, investigates failure mechanisms, and ensures reliability risks are proactively identified and mitigated. The role works closely with GERA engineering authorities, DCO teams, design teams, and vendors to ensure mission‑critical infrastructure reliability is maintained across all campuses.

Key Responsibilities

Portfolio Reliability Governance

Establish and maintain the global reliability governance framework across the data centre portfolio.
Maintain and manage the Global Systemic Risk Register.
Identify systemic infrastructure risks across campuses and define mitigation strategies.
Ensure reliability practices are consistently applied across all sites.

Equipment Performance Monitoring

Establish monitoring frameworks for critical infrastructure equipment performance.
Analyse operating data from electrical and mechanical systems to identify degradation trends.
Monitor redundancy utilisation, abnormal operating conditions, and reliability indicators.
Identify early warning signals for potential equipment failures.

Failure Analysis & Root Cause Investigation

Lead structured Root Cause Analysis (RCA) for major infrastructure incidents.
Perform failure mode analysis using fault tree and event chain methodologies.
Identify recurring failure mechanisms across sites.
Ensure lessons learned from failures are captured and shared across the organisation.

Reliability Risk Assessment

Assess reliability risks associated with infrastructure design, operations, and vendor equipment.
Evaluate cross-campus failure exposure and correlated infrastructure vulnerabilities.
Provide technical recommendations to mitigate systemic reliability risks.

Reliability Data Analytics & Reporting

Develop reliability performance dashboards and trend analysis.
Monitor incident frequency, failure severity, and infrastructure performance trends.
Provide reliability reports and insights to engineering and executive leadership.

Reliability Improvement Programs

Lead initiatives to improve infrastructure resilience and reliability.
Identify reliability improvement opportunities across systems and sites.
Validate effectiveness of remediation and reliability improvement actions.

Lifecycle & Obsolescence Management

Define strategies for managing aging infrastructure and equipment lifecycle risks.
Assess replacement versus life‑extension strategies for critical infrastructure systems.
Support long-term infrastructure planning from a reliability perspective.

Vendor & OEM Reliability Oversight

Evaluate vendor reliability performance and technical claims.
Assess OEM failure rates and design-related reliability exposure.
Collaborate with vendors to address systemic equipment reliability issues.

Cross-Team Collaboration

Work closely with DCO teams, Engineering Authorities, Design Teams, and external vendors.
Support reliability engineering input into ROCC and MCR operational frameworks.
Promote knowledge sharing and reliability best practices across the organisation.

Candidate Requirements

Bachelor's degree in Engineering (Electrical, Mechanical, or related discipline).
10+ years experience in mission‑critical infrastructure
Strong experience in equipment reliability analysis, failure investigation, and RCA
Deep understanding of electrical and mechanical systems in data centre environments.
Proven ability to identify systemic reliability risks and implement mitigation strategies.
Strong analytical and problem‑solving capabilities.
Experience working across multiple sites or regional infrastructure portfolios preferred.
Excellent communication and stakeholder management skills.
Willingness to travel across regional sites when required.

DayOne is proud to be an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

If you're ready to grow with one of the fastest-moving companies in the data center industry, apply now and be part of our global journey.