- Provide supportfor escalated issues in the following areas.
- Infrastructure Monitoring:Assist users with setup and configuration of monitoring tools, interpret alerts, and ensure proper installation of agents across platforms (AWS, Azure, on-premises).
- Logs Monitoring:Support users in accessing log management features, answer queries on ingestion and retention policies, and guide them in navigating the logging interface.
- Application Performance Monitoring (APM):Help users with APM features, including setup, performance thresholds, and accessing metrics without deep data analysis.
- Real User Monitoring (RUM):Assist users in implementing RUM features, addressing questions on user experience metrics and monitoring specific journeys.
- Synthetic Monitoring:Guide users in creating and managing synthetic tests, configuring scenarios, scheduling, and troubleshooting common issues.
- ApplySRE principlesto continuously improve the supportability and maintainabilityof the observability platform, focusing on reducing operational overhead and enhancing system reliability.
- Dashboarding:Provide guidance to users in creating dashboards and using various widgets to optimize the display of critical information.
- Documentation:Maintain comprehensive documentation of Datadog configurations, best practices, and procedures.
- Collaboration:Work closely with application and infrastructure teams to ensure monitoring requirements are met and support troubleshooting efforts.
- Training and Support:Provide training and support to team members on Datadog usage and best practices.
What qualifications or skills should you possess in this role
- Experience:At least 5 years of experience as an Observability Engineer or similar role in IT operations. 2-3 years of experience using observability tools such as Datadog, Dynatrace, AppDynamics, Splunk, or New Relic. Preference will be given to candidates with Datadog experience.
- Technical Skills:Strong knowledge of Observability platform (like Datadog), including infrastructure monitoring, log monitoring, application performance monitoring (APM), synthetic monitoring, and real user monitoring (RUM). Familiarity with cloud platforms (AWS, Azure) and containerization technologies (Kubernetes).
- Scripting:Proficiency in scripting languages (e.g., Python) for automation and integration tasks.
- Problem-Solving:Excellent analytical and problem-solving skills with the ability to troubleshoot complex issues.
- Communication:Strong verbal and written communication skills, with the ability to collaborate effectively with cross-functional teams.
- CI/CD:Familiarity with continuous integration and continuous deployment (CI/CD) pipelines and tools (e.g., GitLab, GitHub).
- Certifications:Relevant certifications (e.g., Datadog Certified, AWS Certified) are a plus.