
Search by job, company or skills
Job Description
Role Overview
We are seeking a highly skilled Site Reliability Engineer (SRE) to lead the reliability, scalability, and performance of our operations. You will be the primary owner of the AWS cloud infrastructure and the end-to-end DevOps pipelines. Your mission is to treat operations as a software problem, automating away manual toil and ensuring our AWS environment delivers a seamless experience for both agents and customers.
Key Responsibilities
1. AWS Connect & Service Desk Reliability
- Infrastructure Management: Design, deploy, and maintain the AWS Connect ecosystem, including Contact Flows, Lambda integrations, Lex Bots, and claim phone numbers using Infrastructure as Code (Terraform/CloudFormation).
- Service Availability: Maintain the always-on state of the service desk. Manage voice and chat channel reliability, ensuring low latency and high audio quality.
- Integration Support: Oversee the reliability of integrations between AWS Connect and ITSM tools (e.g., ServiceNow, Jira Service Management, or Salesforce).
- Capacity Planning: Proactively monitor and scale telephony quotas, concurrent tasks, and backend compute resources to handle peak service desk traffic.
2. Cloud Infrastructure & Security
- AWS Foundation: Manage core AWS services supporting the platform (EC2, ECS/EKS, S3, Lambda, DynamoDB, and VPC networking).
- Security & Compliance: Implement IAM least-privilege policies, encrypt data at rest/transit (KMS), and ensure the platform meets industry standards (SOC2, HIPAA, or PCI-DSS if applicable).
- Cost Optimization: Monitor cloud spend and implement FinOps practices to optimize AWS Connect and infrastructure costs.
3. DevOps & CI/CD Pipeline Engineering
- Pipeline Ownership: Build and maintain robust CI/CD pipelines (GitLab CI, GitHub Actions, or Jenkins) to automate the deployment of Lambda functions, Lex bots, and infrastructure changes.
- Automated Testing: Integrate automated testing into the pipeline to validate contact flow logic and API integrations before they hit production.
- Reliability as Code: Standardize deployment patterns to ensure environment parity between Sandbox, Staging, and Production.
4. Observability & Incident Response
- Monitoring & Alerting: Develop comprehensive dashboards and alerts using CloudWatch, X-Ray, and third-party tools (Grafana, Datadog, or Splunk) to track SLIs.
- Incident Management: Lead troubleshooting for critical production outages. Conduct blameless post-mortems to identify root causes and prevent recurrence.
- Error Budgets: Define and manage Service Level Objectives (SLOs) and Error Budgets for the service desk platform.
Qualifications
Technical Skills:
- AWS Expertise: Deep knowledge of AWS Connect (Contact Flows, CTRs, CCP customization) and general AWS services (Lambda, DynamoDB, S3, IAM).
- Infrastructure as Code (IaC): Proficient in Terraform (preferred), CloudFormation, or AWS CDK.
- CI/CD Tools: Experience building pipelines in GitLab, GitHub Actions, or AWS CodePipeline.
- Programming: Strong scripting skills in Python or Node.js (specifically for AWS Lambda development).
- Observability: Hands-on experience with AWS CloudWatch, Kinesis (for stream analysis), and logging stacks (ELK or Splunk).
Experience & Education:
- 3+ years of experience in an SRE or DevOps role.
- 2+ years of hands-on experience specifically with Amazon Connect or similar CCaaS (Contact Center as a Service) platforms.
- Experience supporting high-volume Service Desk or Call Center environments.
- Preferred Certifications: AWS Certified DevOps Engineer - Professional or AWS Certified SysOps Administrator
Job ID: 147056627