
Search by job, company or skills
Assume a vital position as a key member of a high-performing team that delivers infrastructure and performance excellence. Your role will be instrumental in shaping the future at one of the world's largest and most influential companies.
As a Lead Infrastructure Engineer at JPMorganChase within the Infrastructure Platform, you apply deep knowledge of software, applications, and technical processes within the infrastructure engineering discipline. Continue to evolve your technical and cross-functional knowledge outside of your aligned domain of expertise.
Job responsibilities
. Lead production operations for critical services: act as incident commander for Priority 1/2 events, drive rapid restoration, clear communications, and post-incident reviews with owned, time-bound remediations.
. Own stability and resiliency improvements: implement and standardize patterns (timeouts/retries, circuit breakers, bulkheads, back-pressure, graceful degradation) and run failover/chaos exercises to validate recovery.
. Drive cross-platform architecture and modernization: partner with application, platform, and security teams to design and implement changes that reduce operational risk and improve reliability and performance.
. Deliver hands-on design, development, and troubleshooting for complex infrastructure issues create durable fixes and automation that prevent recurrence and reduce manual toil.
. Manage workstreams end-to-end across one or more infrastructure domains (e.g., Kubernetes, Linux, networking, databases, cloud), ensuring clear scope, milestones, and measurable outcomes.
. Apply strong systems thinking: assess upstream/downstream dependencies and data flows identify technical implications and advise on mitigation, rollout sequencing, and safe change strategies.
. Operate effectively in a 24/7 model: support on-call rotations, improve runbooks and diagnostics, and continually raise the bar on detection, alert quality, and response time.
Required qualifications, capabilities, and skills
. Bachelor's Degree in Computer Science, Cybersecurity, Data Science, or related disciplines
. 5+ years of relevant infrastructure engineering experience, with increasing scope/ownership.
. Deep expertise in one or more core areas: compute and OS (Linux), networking, databases/storage, container orchestration, CI/CD and deployment practices, integration/automation, scaling, resiliency, and performance engineering.
. Strong observability and monitoring proficiency, including metrics, logs, distributed tracing, alerting, and SLO/SLA design.
. Demonstrated troubleshooting across heterogeneous platforms and services, with hands-on administration in Linux, middleware, and databases.
. Practical experience operating modern infrastructure stacks: Linux, Kubernetes, AWS, Terraform and observability tooling such as Splunk, Grafana, Datadog, AWS X-Ray.
. Database exposure with one or more of: Cassandra, Oracle, CockroachDB ability to assess performance, capacity, and resilience trade-offs.
. Proficiency in scripting and software engineering for infrastructure (e.g., Bash, Python) ability to build automation, tooling, and integrations.
. Deep knowledge of cloud infrastructure and services across public and private clouds, including migration patterns and hybrid connectivity.
. Experience identifying and resolving production issues on public cloud platforms ability to lead service improvement plans and problem management.
. Proven experience with LLM orchestration frameworks or custom agent runtimes strong API design, reliability engineering, and end-to-end observability (tracing/metrics/logging). Delivered at least one agentic system to production with quantified impact (e.g., automation rate, latency, cost).
Preferred qualifications, capabilities, and skills
. Incident leadership: serves as incident commander for Sev1/Sev2 events, drives clear comms and rapid restoration, and ensures post-incident reviews with owned, time-bound remediations.
. SRE practices at scale: defines/enforces SLOs/SLIs and error budgets improves on-call quality with actionable runbooks, sustainable alerting, and clear escalation paths.
. Observability and automation: advances metrics/logs/traces and synthetic probes builds self-heal automation for diagnostics and common remediations.
To apply for this position, please use the following URL:
https://ars2.equest.com/response_id=247a53218b98f18244918bd112792f38
Job ID: 144213539