Leadership and Management
- Lead infrastructure engineering teams to deliver comprehensive managed services for entire IT infrastructure environments.
- Direct desktop engineering teams to provide first-level support and technical problem resolution for end-user communities.
Strategic Operations
- Oversee and direct daily IT infrastructure operations, ensuring reliable and secure system, service, and application performance.
- Monitor and manage incident response for business-critical systems with focus on timely resolution to prevent operational delays and service outages.
Organisational Engagement
- Demonstrate capability to engage effectively with organisational management whilst establishing guidelines, policies, and procedures with strong execution oversight.
- Manage multiple concurrent deadlines as a self-directed professional with appropriate prioritisation skills.
Operational Excellence
- Monitor and respond to data centre issues and incidents whilst performing routine operational checks on servers, network devices, storage, and environmental systems.
- Track IT asset inventory ensuring comprehensive equipment accountability and end-of-life management.
Incident and Change Management
- Respond promptly to system alerts, alarms, and incidents with appropriate escalation to support teams following defined procedures.
- Support incident troubleshooting and recovery activities whilst managing planned maintenance, change requests, and scheduled outages.
- Coordinate hardware installation, replacement, and decommissioning activities alongside media handling and secure storage management.
Infrastructure Management
- Design, build, and maintain critical cloud infrastructure platforms encompassing compute, storage, networking, containerisation, virtualisation, DNS, monitoring, and supporting systems across development, staging, and production environments.
- Monitor and manage comprehensive cloud services including CloudWatch logs, alarms, synthetic monitoring, and integrated third-party solutions.
Monitoring and Observability
- Implement and maintain robust monitoring and observability frameworks for all platform components utilising modern tooling including AWS CloudWatch Canaries, StackOps, Prometheus, Grafana, and ELK stack implementations.
- Establish comprehensive observability practices to support proactive problem diagnosis and provide actionable insights into system health and performance metrics.
Compliance and Security
- Maintain adherence to Whole-of-Government platform standards, compliance frameworks, and security requirements through continuous monitoring using government-approved security and monitoring solutions.
- Implement security controls including access management, security hardening, and compliance monitoring with tools such as CyberArk.
Automation and Infrastructure as Code
- Develop and maintain infrastructure using Infrastructure as Code (IaC) methodologies with tools including Terraform, Ansible, and AWS CloudFormation to ensure repeatable, automated, and version-controlled deployments.
- Follow platform standards whilst executing infrastructure automation and modern operational practices to enhance efficiency and reliability.
Site Reliability Engineering
- Identify and eliminate repetitive operational tasks to improve Developer and Infrastructure Engineer efficiency whilst enhancing overall system reliability through systematic toil elimination and error budget management.
- Define, track, and report on SRE metrics including Service Level Objectives (SLO), Service Level Indicators (SLI), and error budgets.
Platform Operations
- Manage virtualisation platforms including VMware vSphere and Hyper-V, encompassing capacity monitoring, performance optimisation, and lifecycle management.
- Administer AWS Cloud services including EC2, ECS, S3, RDS (PostgreSQL and MS SQL), Docker/Kubernetes, Lambda, CloudFormation, CloudWatch, IAM, and VPC configurations alongside physical server infrastructure.
Network and System Administration
- Demonstrate proficiency with local networking technologies including TCP/IP, DNS, DHCP, VPN configurations, and routing protocols.
- Execute comprehensive platform patching strategies leveraging automation to maintain security and stability whilst minimising service disruption.
Business Continuity
- Maintain backup, disaster recovery, and high availability solutions for critical platform components including AWS Fault Injection Simulator (FIS) testing and multi-availability zone configurations.
- Support containerisation initiatives and maintain container orchestration platforms for traditional workloads.
Collaboration and Documentation
- Collaborate effectively with application teams to support platform stability, performance, and scalability requirements.
- Create and maintain comprehensive platform documentation, operational runbooks, and standard operating procedures.
- Support team development through knowledge sharing and mentoring on platform operations and modern infrastructure practices.
TECHNICAL SKILLS
- Advanced experience with enterprise virtualisation platforms (VMware vSphere, Hyper-V)
- Proficiency in Linux and Windows Server administration
- Expertise in server monitoring tool installation and regular patching of virtual and physical servers
- Comprehensive health check capabilities for servers, storage, and virtualisation platforms
- Strong experience with infrastructure automation tools (Ansible, Puppet, Chef)
- Proficiency with container technologies (ECS, Docker, Kubernetes)
- Experience with monitoring and observability platforms
- Infrastructure as Code expertise (Terraform, AWS CloudFormation, Ansible)
- Solid understanding of networking concepts and technologies
- Scripting capabilities in Python, PowerShell, Bash, and Node.js
- Experience with high-availability and disaster recovery solutions including AWS FIS
- Proficiency with GitHub tools and CI/CD pipeline setup and workflow management
Professional Qualifications
- Bachelor's degree in computer science, Information Technology, or related technical discipline with demonstrated experience in infrastructure operations and engineering.
- Strong understanding of enterprise infrastructure components with proven experience supporting infrastructure modernisation initiatives.
- Excellent analytical and problem-solving capabilities with strong documentation skills and effective communication abilities for both technical and non-technical stakeholders.
Desired Certifications
- VMware Certified Professional (VCP) or Windows vSphere
- Microsoft Certified: Windows Server
- Red Hat Certified Engineer (RHCE)
- AWS Certified Solutions Architect or AWS Certified SysOps Administrator
- Additional certifications in networking, security, or government IT standards.
- Previous experiences in government or highly regulated environments are strongly preferred.