
Search by job, company or skills

KEY RESPONSIBILITIES
Infrastructure Management: Design, build, and maintain critical cloud infrastructure platforms encompassing compute, storage, networking, containerisation, virtualisation, DNS, monitoring, and supporting systems across development, staging, and production environments. Monitor and manage comprehensive cloud services including CloudWatch logs, alarms, synthetic monitoring, and integrated third-party solutions.
Monitoring and Observability: Implement and maintain robust monitoring and observability frameworks for all platform components utilising modern tooling including AWS CloudWatch Canaries, StackOps, Prometheus, Grafana, and ELK stack implementations. Establish comprehensive observability practices to support proactive problem diagnosis and provide actionable insights into system health and performance metrics.
Compliance and Security: Maintain adherence to Whole-of-Government platform standards, compliance frameworks, and security requirements through continuous monitoring using government-approved security and monitoring solutions. Implement security controls including access management, security hardening, and compliance monitoring with tools such as CyberArk.
Automation and Infrastructure as Code: Develop and maintain infrastructure using Infrastructure as Code (IaC) methodologies with tools including Terraform, Ansible, and AWS CloudFormation to ensure repeatable, automated, and version-controlled deployments. Follow platform standards whilst executing infrastructure automation and modern operational practices to enhance efficiency and reliability.
Site Reliability Engineering: Identify and eliminate repetitive operational tasks to improve Developer and Infrastructure Engineer efficiency whilst enhancing overall system reliability through systematic toil elimination and error budget management. Define, track, and report on SRE metrics including Service Level Objectives (SLO), Service Level Indicators (SLI), and error budgets.
Platform Operations: Manage virtualisation platforms including VMware vSphere and Hyper-V, encompassing capacity monitoring, performance optimisation, and lifecycle management. Administer AWS Cloud services including EC2, ECS, S3, RDS (PostgreSQL and MS SQL), Docker/Kubernetes, Lambda, CloudFormation, CloudWatch, IAM, and VPC configurations alongside physical server infrastructure.
Network and System Administration: Demonstrate proficiency with local networking technologies including TCP/IP, DNS, DHCP, VPN configurations, and routing protocols. Execute comprehensive platform patching strategies leveraging automation to maintain security and stability whilst minimising service disruption.
Business Continuity: Maintain backup, disaster recovery, and high availability solutions for critical platform components including AWS Fault Injection Simulator (FIS) testing and multi-availability zone configurations. Support containerisation initiatives and maintain container orchestration platforms for traditional workloads.
Collaboration and Documentation: Collaborate effectively with application teams to support platform stability, performance, and scalability requirements. Create and maintain comprehensive platform documentation, operational runbooks, and standard operating procedures. Support team development through knowledge sharing and
mentoring on platform operations and modern infrastructure practices.
EXPERIENCE AND SKILLS REQUIRED
Technical Expertise:
Advanced experience with enterprise virtualisation platforms (VMware vSphere, Hyper-V)
Proficiency in Linux and Windows Server administration
Expertise in server monitoring tool installation and regular patching of virtual and physical servers
Comprehensive health check capabilities for servers, storage, and virtualisation platforms
Strong experience with infrastructure automation tools (Ansible, Puppet, Chef)
Proficiency with container technologies (ECS, Docker, Kubernetes)
Experience with monitoring and observability platforms
Infrastructure as Code expertise (Terraform, AWS CloudFormation, Ansible)
Solid understanding of networking concepts and technologies
Scripting capabilities in Python, PowerShell, Bash, and Node.js
Experience with high-availability and disaster recovery solutions including AWS FIS
Proficiency with GitHub tools and CI/CD pipeline setup and workflow management
Professional Qualifications: Bachelor's degree in computer science, Information Technology, or related technical discipline with demonstrated experience in infrastructure operations and engineering. Strong understanding of enterprise infrastructure components with proven experience supporting infrastructure modernisation initiatives.
Core Competencies: Excellent analytical and problem-solving capabilities with strong documentation skills and effective communication abilities for both technical and non-technical stakeholders.
Desired Certifications:
VMware Certified Professional (VCP) or Windows vSphere
Microsoft Certified: Windows Server
Red Hat Certified Engineer (RHCE)
AWS Certified Solutions Architect or AWS Certified SysOps Administrator
Additional certifications in networking, security, or government IT standards. Previous experiences in government or highly regulated environments are strongly preferred.
HTC Global Services
Established in 1990, HTC Global Services is an Inc. 500 Hall of Fame company and one of the fastest growing Asian American companies in the US with headquarters in Troy, Michigan. A global provider of IT Solutions and Business Process Outsourcing services, HTC’s client base spans several Global 2000 organizations. HTC is committed to providing solutions that translate into tangible business outcomes for our customers. HTC manages IT environments, IT applications, and business processes of customers, focusing on providing transformational benefits.
Mission:
We are a global IT solutions provider adding value to our clients and people through emerging technologies. We are dedicated to the success of our clients, employees, business partners, suppliers, community, and stakeholders.
Job ID: 142944577