Site Reliability (Site Reliability Engineering)

itcan pte. limited

Singapore, Cecil Street

3-5 Years

SGD 9,000 - 9,500 per month

Save

Posted 22 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Role Overview

We are seeking a highly skilled Site Reliability Engineer (SRE) to lead the reliability, scalability, and performance of our operations. You will be the primary owner of the AWS cloud infrastructure and the end-to-end DevOps pipelines. Your mission is to treat operations as a software problem, automating away manual toil and ensuring our AWS environment delivers a seamless experience for both agents and customers.

Key Responsibilities

1. AWS Connect & Service Desk Reliability

- Infrastructure Management: Design, deploy, and maintain the AWS Connect ecosystem, including Contact Flows, Lambda integrations, Lex Bots, and claim phone numbers using Infrastructure as Code (Terraform/CloudFormation).

- Service Availability: Maintain the always-on state of the service desk. Manage voice and chat channel reliability, ensuring low latency and high audio quality.

- Integration Support: Oversee the reliability of integrations between AWS Connect and ITSM tools (e.g., ServiceNow, Jira Service Management, or Salesforce).

- Capacity Planning: Proactively monitor and scale telephony quotas, concurrent tasks, and backend compute resources to handle peak service desk traffic.

2. Cloud Infrastructure & Security

- AWS Foundation: Manage core AWS services supporting the platform (EC2, ECS/EKS, S3, Lambda, DynamoDB, and VPC networking).

- Security & Compliance: Implement IAM least-privilege policies, encrypt data at rest/transit (KMS), and ensure the platform meets industry standards (SOC2, HIPAA, or PCI-DSS if applicable).

- Cost Optimization: Monitor cloud spend and implement FinOps practices to optimize AWS Connect and infrastructure costs.

3. DevOps & CI/CD Pipeline Engineering

- Pipeline Ownership: Build and maintain robust CI/CD pipelines (GitLab CI, GitHub Actions, or Jenkins) to automate the deployment of Lambda functions, Lex bots, and infrastructure changes.

- Automated Testing: Integrate automated testing into the pipeline to validate contact flow logic and API integrations before they hit production.

- Reliability as Code: Standardize deployment patterns to ensure environment parity between Sandbox, Staging, and Production.

4. Observability & Incident Response

- Monitoring & Alerting: Develop comprehensive dashboards and alerts using CloudWatch, X-Ray, and third-party tools (Grafana, Datadog, or Splunk) to track SLIs.

- Incident Management: Lead troubleshooting for critical production outages. Conduct blameless post-mortems to identify root causes and prevent recurrence.

- Error Budgets: Define and manage Service Level Objectives (SLOs) and Error Budgets for the service desk platform.

Qualifications

Technical Skills:

- AWS Expertise: Deep knowledge of AWS Connect (Contact Flows, CTRs, CCP customization) and general AWS services (Lambda, DynamoDB, S3, IAM).

- Infrastructure as Code (IaC): Proficient in Terraform (preferred), CloudFormation, or AWS CDK.

- CI/CD Tools: Experience building pipelines in GitLab, GitHub Actions, or AWS CodePipeline.

- Programming: Strong scripting skills in Python or Node.js (specifically for AWS Lambda development).

- Observability: Hands-on experience with AWS CloudWatch, Kinesis (for stream analysis), and logging stacks (ELK or Splunk).

Experience & Education:

- 3+ years of experience in an SRE or DevOps role.

- 2+ years of hands-on experience specifically with Amazon Connect or similar CCaaS (Contact Center as a Service) platforms.

- Experience supporting high-volume Service Desk or Call Center environments.