Watchdog is an algorithmic feature for APM performances and infrastructure metrics that automatically detects potential application and infrastructure issues. It leverages the same seasonal algorithms that power anomalies and dashboards. Watchdog observes trends and patterns in:
Infrastructure metrics from integrations:
Watchdog looks for irregularities in metrics, like a sudden spike in the hit rate. For each irregularity, the Watchdog page displays a Watchdog story. Each story includes a graph of the detected metric irregularity and gives more information about the relevant timeframe and endpoint or endpoints. To avoid false alarms, Watchdog only reports issues after observing your data for a sufficient amount of time to establish a high degree of confidence. The minimum amount of data needed to see irregularities depends on the anomaly you are looking at and can range from four days to two weeks.
Watchdog RCA enables APM customers to identify causal relationships between different symptoms across your applications and infrastructures. This information helps you to speed up your root cause analysis and reduce your mean time to recovery (MTTR).
Watchdog can group related data together, draw connections between groups, and prioritize the most important areas to focus on.
Watchdog considers the relationships between the following types of signals:
Watchdog also correlates signals and anomalies from the different parts of your infrastructure (logs, traces, and metrics) and adds them as evidence to each RCA story. To enable this, it is recommended that you set up unified tagging across your telemetry.
Clicking on the story shows further details about the detected irregularity:
The graph in this story shows the latency values of the ELB in three different availability zones. Watchdog detected similar anomalies in this metric from a single load balancer enabled in three availability zones, and automatically grouped these findings together in a single story. After a period of consistently low latency, the metric in all three AZs rises sharply—in the highlighted area of the graph, which indicates the timeframe of the anomaly.
Selecting Show expected bounds in the upper-right corner reveals upper and lower thresholds of expected behavior on the graph.
Use the folder icon in the upper-right corner of a story to archive it. Archiving hides the story from the feed, as well as other places in the Datadog application, like the home page. If a story is archived, the yellow Watchdog binoculars icon does not show up next to the relevant service or resource.
To see archived stories, select the checkbox option to “Show N archived stories” in the top left. You can also see who archived each story and when, and restore archived stories to your feed.
Note: Archiving does not prevent Watchdog from flagging future issues related to the service or resource.
When an anomaly appears in one service, there’s often a corresponding anomaly in a related service. For example, if one service’s database queries get throttled, any downstream service will experience elevated latency. You need to troubleshoot this not as two separate issues, but rather as one issue stemming from a single root cause.
Watchdog automatically groups related APM anomalies into a single story whenever it detects an issue that affects multiple services. The story will include a dependency map that shows the service where the issue originated and the downstream dependencies that were affected. This gives you visibility on the impact of the issue and a quick path to the source of the issue and to move on resolution.
To speed up further investigations, Datadog may suggest some of your dashboards that are related to the story. In this case, Datadog will highlight which of the dashboard’s metrics are related to the insights in the story.
Monitors associated with your stories are displayed at the bottom. Each monitor displayed has the metric of the current story and its associated tags included in its scope.
Additionally, Watchdog suggests one or more monitors that are configured to trigger if the story happens again. Click the Enable Monitor button to enable them for your organization. See the Watchdog monitor documentation to learn how to create a Watchdog monitor.
You can use the time range, search bar, or facets to filter your Watchdog stories:
Use the time range selector in the upper right to view stories detected in a specific time range. You can view any story that happened in the last 13 months, going back to March 2019.
Typing in the Filter stories search box enables you to search over your story titles.
Facets are associated with your Watchdog stories, allowing you to filter them by:
|Story Category||Display all |
|Story Type||Which metrics from APM or infrastructure integrations stories should be displayed.|
|APM Environment||The APM Environment to display stories from.|
|APM Primary Tag||The defined APM primary tag to display stories from.|
|APM Service||The APM Service to display stories from.|
When an irregularity in a metric is detected, the yellow Watchdog binoculars icon appears next to the affected service in the APM services list. The number next to the binoculars indicates the number of issues Watchdog has noticed within that service.
If Watchdog has discovered something out of the ordinary in a specific service, viewing the corresponding Service page reveals a dedicated Watchdog section in the middle of the page, between the application performance graphs and the latency distribution section. The Watchdog section displays any relevant Watchdog Stories.
When Watchdog RCA detects an anomaly with your application, it creates a story and link it with user defined monitors that have triggered. The Watchdog story is visible directly at the top of the triggered monitor page.
Need help? Contact Datadog support.
Additional helpful documentation, links, and articles: