Watchdog is an algorithmic feature for APM performance and infrastructure metrics that automatically detects potential application and infrastructure issues. It leverages the same seasonal algorithms that power anomalies and dashboards. Watchdog observes trends and patterns in:
Watchdog looks for irregularities in metrics, like a sudden spike in the hit rate. For each irregularity, the Watchdog page displays a Watchdog story. Each story includes a graph of the detected metric irregularity and gives more information about the relevant timeframe and endpoint or endpoints. To avoid false alarms, Watchdog only reports issues after observing your data for a sufficient amount of time to establish a high degree of confidence. The minimum amount of data needed to see irregularities depends on the anomaly you are looking at and can range from four days to two weeks.
Root Cause Analysis for APM (beta)
Watchdog Root Cause Analysis (RCA) is currently in beta. Use this form
to request access.
Watchdog RCA enables APM customers to identify causal relationships between different symptoms across your applications and infrastructures. This information helps you to speed up your root cause analysis and reduce your mean time to recovery (MTTR).
Watchdog can group related data together, draw connections between groups, and prioritize the most important areas to focus on.
Watchdog considers the relationships between the following types of signals:
- APM error rate, latency, and hit rate increases
- New deployments with APM service version changes
- APM error traces
- Introduction of new APM resources
- Changes to traced database queries
- Agent based infrastructure metrics (high CPU usage, high memory usage, high disk usage, unreachable hosts, etc.)
- Error logs patterns anomalies
- Triggered alerts from your own monitors
Watchdog also correlates signals and anomalies from the different parts of your infrastructure (logs, traces, and metrics) and adds them as evidence to each RCA story. To enable this, it is recommended that you set up unified tagging across your telemetry.
Clicking on the story shows further details about the detected irregularity:
The graph in this story shows the latency values of the ELB in three different availability zones. Watchdog detected similar anomalies in this metric from a single load balancer enabled in three availability zones, and automatically grouped these findings together in a single story. After a period of consistently low latency, the metric in all three AZs rises sharply—in the highlighted area of the graph, which indicates the timeframe of the anomaly.
Selecting Show expected bounds in the upper-right corner reveals upper and lower thresholds of expected behavior on the graph.
Use the folder icon in the upper-right corner of a story to archive it. Archiving hides the story from the feed, as well as other places in the Datadog application, like the home page. If a story is archived, the yellow Watchdog binoculars icon does not show up next to the relevant service or resource.
To see archived stories, select the checkbox option to “Show N archived stories” in the top left. You can also see who archived each story and when, and restore archived stories to your feed.
Note: Archiving does not prevent Watchdog from flagging future issues related to the service or resource.
When an anomaly appears in one service, there’s often a corresponding anomaly in a related service. For example, if one service’s database queries get throttled, any downstream service will experience elevated latency. You need to troubleshoot this not as two separate issues, but rather as one issue stemming from a single root cause.
Watchdog automatically groups related APM anomalies into a single story whenever it detects an issue that affects multiple services. The story will include a dependency map that shows the service where the issue originated and the downstream dependencies that were affected. This gives you visibility on the impact of the issue and a quick path to the source of the issue and to move on resolution.
To speed up further investigations, Datadog may suggest some of your dashboards that are related to the story. In this case, Datadog will highlight which of the dashboard’s metrics are related to the insights in the story.
Monitors associated with your stories are displayed at the bottom. Each monitor displayed has the metric of the current story and its associated tags included in its scope.
Additionally, Watchdog suggests one or more monitors that are configured to trigger if the story happens again. Click the Enable Monitor button to enable them for your organization. See the Watchdog monitor documentation to learn how to create a Watchdog monitor.
Watchdog Impact Analysis
Whenever Watchdog finds a new APM anomaly, it simultaneously analyzes a variety of latency and error metrics that are submitted from the RUM SDKs to evaluate if the anomaly is adversely impacting any web or mobile pages visited by your users.
If Watchdog determines that the end-user experience is impacted, it provides a summary of the impacts in Watchdog APM Alert. This includes:
- A list of impacted RUM views
- An estimated number of impacted users
- A link to the list of impacted users, so that you can reach out to them, if needed.
This feature is automatically enabled for all APM and RUM users. Whenever Watchdog APM alerts are associated with end-user impacts, affected users and view paths appear in the Impacts section of your Watchdog stories. Click users to view the affected users’ contact information if you need to reach out to them. Click view paths to access the impacted RUM views for additional information.
You can use the time range, search bar, or facets to filter your Watchdog stories:
Use the time range selector in the upper right to view stories detected in a specific time range. You can view any story that happened in the last 13 months, going back to March 2019.
Typing in the Filter stories search box enables you to search over your story titles.
Facets are associated with your Watchdog stories, allowing you to filter them by:
apm or all
||Which metrics from APM or infrastructure integrations stories should be displayed.
||The APM Environment to display stories from.
|APM Primary Tag
||The defined APM primary tag to display stories from.
||The APM Service to display stories from.
Watchdog in the services list
When an irregularity in a metric is detected, the yellow Watchdog binoculars icon appears next to the affected service in the APM services list. The number next to the binoculars indicates the number of issues Watchdog has noticed within that service.
If Watchdog has discovered something out of the ordinary in a specific service, viewing the corresponding Service page reveals a dedicated Watchdog section in the middle of the page, between the application performance graphs and the latency distribution section. The Watchdog section displays any relevant Watchdog Stories.
Watchdog with alerts
When Watchdog RCA detects an anomaly with your application, it creates a story and link it with user defined monitors that have triggered. The Watchdog story is visible directly at the top of the triggered monitor page.
Need help? Contact Datadog support.
Additional helpful documentation, links, and articles: