Job Description :
Assume a vital position as a key member of a high-performing team that delivers infrastructure and performance excellence. Your role will be instrumental in shaping the future at one of the world's largest and most influential companies.
As a Lead Infrastructure Engineer at JPMorganChase within the Infrastructure Platform, youapply deep knowledge of software, applications, and technical processes within the infrastructure engineering discipline. Continue to evolve your technical and cross-functional knowledge outside of your aligned domain of expertise.
Job responsibilities
- Lead production operations for critical services: act as incident commander for Priority 1/2 events, drive rapid restoration, clear communications, and post-incident reviews with owned, time-bound remediations.
- Own stability and resiliency improvements: implement and standardize patterns (timeouts/retries, circuit breakers, bulkheads, back-pressure, graceful degradation) and run failover/chaos exercises to validate recovery.
- Drive cross-platform architecture and modernization: partner with application, platform, and security teams to design and implement changes that reduce operational risk and improve reliability and performance.
- Deliver hands-on design, development, and troubleshooting for complex infrastructure issues create durable fixes and automation that prevent recurrence and reduce manual toil.
- Manage workstreams end-to-end across one or more infrastructure domains (e.g., Kubernetes, Linux, networking, databases, cloud), ensuring clear scope, milestones, and measurable outcomes.
- Apply strong systems thinking: assess upstream/downstream dependencies and data flows identify technical implications and advise on mitigation, rollout sequencing, and safe change strategies.
- Operate effectively in a 24/7 model: support on-call rotations, improve runbooks and diagnostics, and continually raise the bar on detection, alert quality, and response time.
Required qualifications, capabilities, and skills
- Bachelor's Degree in Computer Science, Cybersecurity, Data Science, or related disciplines
- 5+ years of relevant infrastructure engineering experience, with increasing scope/ownership.
- Deep expertise in one or more core areas: compute and OS (Linux), networking, databases/storage, container orchestration, CI/CD and deployment practices, integration/automation, scaling, resiliency, and performance engineering.
- Strong observability and monitoring proficiency, including metrics, logs, distributed tracing, alerting, and SLO/SLA design.
- Demonstrated troubleshooting across heterogeneous platforms and services, with hands-on administration in Linux, middleware, and databases.
- Practical experience operating modern infrastructure stacks: Linux, Kubernetes, AWS, Terraform and observability tooling such as Splunk, Grafana, Datadog, AWS X-Ray.
- Database exposure with one or more of: Cassandra, Oracle, CockroachDB ability to assess performance, capacity, and resilience trade-offs.
- Proficiency in scripting and software engineering for infrastructure (e.g., Bash, Python) ability to build automation, tooling, and integrations.
- Deep knowledge of cloud infrastructure and services across public and private clouds, including migration patterns and hybrid connectivity.
- Experience identifying and resolving production issues on public cloud platforms ability to lead service improvement plans and problem management.
- Proven experience with LLM orchestration frameworks or custom agent runtimes strong API design, reliability engineering, and end-to-end observability (tracing/metrics/logging). Delivered at least one agentic system to production with quantified impact (e.g., automation rate, latency, cost).
Preferred qualifications, capabilities, and skills
- Incident leadership: serves as incident commander for Sev1/Sev2 events, drives clear comms and rapid restoration, and ensures post-incident reviews with owned, time-bound remediations.
- SRE practices at scale: defines/enforces SLOs/SLIs and error budgets improves on-call quality with actionable runbooks, sustainable alerting, and clear escalation paths.
- Observability and automation: advances metrics/logs/traces and synthetic probes builds self-heal automation for diagnostics and common remediations.