Job Summary
We are looking for a highly skilled Senior Splunk Engineer to design, implement, and manage enterprise-scale SIEM and observability solutions. This role will focus on enhancing system visibility, ensuring platform reliability, and supporting security and compliance requirements within a regulated environment. The ideal candidate will have strong expertise in Splunk, cloud platforms, and SRE practices, along with the ability to troubleshoot complex issues and drive continuous improvements.
Responsibilities
- Design, implement, and maintain Splunk-based SIEM and observability platforms.
- Develop and optimize log ingestion, parsing, correlation searches, dashboards, and alerts.
- Integrate Splunk with cloud platforms (AWS, Azure) and enterprise tools such as ServiceNow and Datadog.
- Define and implement monitoring strategies, including SLIs/SLOs, service health models, and alerting frameworks.
- Perform incident investigation, troubleshooting, and root cause analysis (RCA) for system and application issues.
- Build and implement automation and auto-remediation solutions using Terraform, Ansible, and Python.
- Support CI/CD pipelines for Splunk configurations and infrastructure deployments.
- Ensure adherence to security, compliance, and regulatory standards, particularly within financial services environments.
- Collaborate with cross-functional teams (Infrastructure, Security, DevOps, and Application teams) to improve observability and reliability.
- Drive continuous improvement initiatives and adopt SRE best practices.
Requirements
- Bachelor's degree in Computer Science, Engineering, or a related discipline.
- Minimum 8 years of experience in Infrastructure, Cloud, or SRE roles, with at least 5 years specializing in Splunk/SIEM engineering or observability.
- Strong hands-on expertise in:
- SIEM Platforms: Splunk (mandatory), Elastic (ELK Stack)
- Automation & IaC: Terraform, Ansible, Python, CI/CD tools
- Cloud Platforms & Integrations:
- AWS (CloudWatch, X-Ray, CloudTrail)
- Azure (Monitor, Log Analytics, Application Insights)
- Datadog, ServiceNow
- Deep understanding of SRE principles, including service health modeling, SLIs/SLOs, error budgets, and auto-remediation.
- Strong analytical and troubleshooting skills with experience in deep-dive investigations and long-term solutioning.
- Familiarity with financial sector operational resilience, regulatory compliance, and incident governance frameworks.
- Excellent written and verbal communication skills.
- Strong interpersonal skills with the ability to engage and collaborate with diverse stakeholders.
- Agile mindset with the ability to learn quickly and adapt to changing environments.