A technology-driven trading firm operating in global financial markets, built on robust, low-latency systems, disciplined engineering, and a culture that values ownership, precision, and continuous improvement. Technology is at the core of everything they do.
The Role
As a Site Reliability Engineer, you will be responsible for the reliability, performance, and scalability of mission-critical trading systems. You will work closely with software engineers, traders, and infrastructure teams to ensure the platforms operate with high availability and predictable performance in a fast-paced, real-time environment.
This role is hands-on and suited to engineers who enjoy deep technical challenges, production ownership, and building resilient systems at scale.
Responsibilities
- Design, build, and operate highly reliable and scalable production systems
- Monitor, troubleshoot, and resolve production issues across trading and market data platforms
- Improve system observability through metrics, logging, and alerting
- Automate operational workflows and reduce manual toil
- Partner with engineering teams to influence system design for reliability and operability
- Participate in on-call rotations and incident response, including post-incident reviews
- Continuously improve deployment, capacity planning, and disaster recovery practices
Requirements
- Strong experience in Site Reliability Engineering, Systems Engineering, or Production Engineering
- Solid understanding of Linux systems, networking, and distributed systems
- Proficiency in at least one programming or scripting language (e.g. Python, Go, Java, C++, Bash)
- Experience with monitoring and observability tools (e.g. Prometheus, Grafana, ELK, Datadog)
- Familiarity with containerisation and orchestration technologies (e.g. Docker, Kubernetes)
- Experience operating systems in high-availability, low-latency, or high-throughput environments
- Ability to debug complex issues under pressure and communicate clearly during incidents
Nice to Have
- Experience in trading, financial services, or other latency-sensitive environments
- Knowledge of cloud platforms (AWS, GCP, Azure) and hybrid infrastructure
- Experience with infrastructure-as-code tools (e.g. Terraform, Ansible)
- Understanding of TCP/IP performance tuning and kernel-level optimisations