We are seeking an Azure Infrastructure Engineer to design, implement, and maintain the core cloud architecture for unified data platform on Microsoft Azure. The ideal candidate will have extensive hands-on experience in Infrastructure as Code (#IaC), enterprise networking, and cloud security to provide a stable, secure, and automated foundation for our enterprise-wide data and analytics initiatives. The engineer will support data ingestion, analytics, AI/ML workloads, infrastructure automation, monitoring, cost optimization, and L2/L3 support across production environments.
Key Responsibilities:
- Architect and Provision Data Infrastructure: Design and deploy the underlying Azure infrastructure for data services including Azure Synapse Analytics, Azure Data Factory, Azure Data Lake Storage (ADLS Gen2), and Databricks workspaces using a modular, reusable approach.
- Automated Environment Management: Build and maintain Terraform configurations to automate the lifecycle of the data platform across Development, Staging, and Production environments. Manage Terraform state files securely using Azure Storage with state locking.
- Network & Connectivity Design: Implement enterprise networking solutions including hub-and-spoke topology, Virtual WAN, and Azure Firewall. Ensure all data assets are isolated using Private Endpoints, VNets, and Network Security Groups (NSGs).
- Platform Security & Governance: Manage Azure Active Directory (Entra ID), implementation of RBAC, and data security policies. Configure encryption-at-rest/transit and manage secrets using Azure Key Vault.
- AI & ML Infrastructure Support: Provision and harden Azure Machine Learning workspaces and compute clusters. Implement the infrastructure requirements for Azure OpenAI service integrations, focusing on private connectivity and quota management.
- DevSecOps Integration: Collaborate with data teams to support CI/CD pipelines using Azure DevOps. Implement pipelines/workflows for infrastructure changes, including pull request reviews and automated testing gates.
- Observability & Monitoring: Monitor the health of the infrastructure using Azure Monitor and Log Analytics. Set up alerts for platform availability, connectivity issues, and resource performance.
- L2/L3 Infrastructure Escalation: Troubleshoot and resolve complex escalations related to networking bottlenecks, service-level failures, and IAM permission conflicts within the data platform.
- Cost Management & FinOps: Optimize cloud costs by analysing resource utilization (e.g., Databricks clusters, Synapse pools) and implementing autoscaling, right-sizing, and shutdown schedules.
- Business Continuity: Design and implement backup, disaster recovery, and high-availability strategies for critical infrastructure components and storage accounts.
- Documentation: Maintain comprehensive documentation for platform architecture, infrastructure diagrams, runbooks, and disaster recovery procedures.
Required Qualifications:
- 4-8+ years of experience working with Microsoft Azure cloud services, with a specific focus on platform engineering and infrastructure automation, preferably in Singapore Government project (GCC).
- Expert-level Infrastructure as Code (IaC): Deep hands-on experience with Terraform, ARM templates, or Bicep for managing complex cloud environments.
- Advanced Networking Knowledge: Proficiency in designing secure environments using Private Links, Hub-and-Spoke models, and Azure Firewall.
- Platform Experience: Strong experience in provisioning Azure Synapse, Data Factory, Azure Data Lake, Azure Foundry and AzureML workspaces.
- Automation & Scripting: Proficiency in PowerShell, Azure CLI, or Python for infrastructure automation and management tasks.
- DevOps Tooling: Experience building and maintaining CI/CD pipelines in Azure DevOps.
- Certifications: Microsoft Azure certifications such as AZ-305 (Azure Solutions Architect Expert) or AZ-104 (Azure Administrator) are required. Terraform Associate certification is highly preferred.
- Troubleshooting: Strong skills in performance tuning and root cause analysis for infrastructure-level issues.
Preferred Skills:
- Security & Compliance: Knowledge of security best practices (CIS benchmarks) and compliance frameworks (IM8) relevant to data platforms and infra in Government on Commercial Cloud(GCC).
- Containerization: Experience with Docker and Azure Kubernetes Service (AKS) as it relates to hosting ML or data workloads.
- Modern Data Architecture: Familiarity with the infrastructure requirements of Medallion and Lakehouse architectures.
- AI/LLM Infrastructure: Understanding of the resource and networking requirements for scaling Azure OpenAI and generative AI applications.