Job Description:
- SLOs & error budgets - Define, track, and evangelize latency and availability targets for our payment APIs.
- Observability - Deploy Cloud Monitoring, Cloud Trace, Error Reporting, and dashboards integrate alerts via Incident.io and Slack for on-call.
- Incident lifecycle - Establish blameless postmortems, guardrails, and runbooks to drive learning and prevent recurrence.
- CI/CD golden path - Codify Cloud Build pipelines and automated canary rollouts for Cloud Functions / Cloud Run.
- Infrastructure as Code - Manage GCP resources embed security, IAM least-privilege, and cost controls by default.
- Performance & cost tuning - Profile hot paths (BigQuery, Firestore, Pub/Sub), and implement caching or concurrency improvements to keep user latency 100 ms.
- Developer tooling - Eliminate toil by improving local-to-prod parity, secrets management, and spinning up environments with a single command.
- Culture carrier - Instill reliability thinking across engineering and product as the first platform-focused hire.
Requirements:
- At least 5+ years of experience building/operating production systems at scale, ideally on Google Cloud or a similar serverless stack, ideally in fast-paced or startup settings.
- Handson Fluency with Firebase, Cloud Build, Cloud Run/Functions, Pub/Sub, Cloud SQL/Spanner, VPC Service Controls.
- Strong coding in Python or Go for automation, with an eye on maintainability.
- Demonstrated record of driving observability, oncall and cost optimisation in a fastmoving environment.
- Excellent collaboration and communication skills to work effectively with cross-functional teams.
- Experience in payments, PCIDSS, or crypto settlement flows is a bonus.
Tech note: we are 99 % serverless. There are no pet VMs to patch, but the stakes are higher: every coldstart, DB connection pool and retry policy can impact real money transfers. You'll architect for resiliency and velocity.