Watchdog Insights

Overview

Investigating an incident requires trial and error. Drawing from their experience, engineers familiar with a particular area know where to first look for potential problems. Using Watchdog Insights allows all engineers, including less experienced ones, to pay attention to the most important data and accelerate their incident investigations.

Throughout most of Datadog, Watchdog returns two types of insights:

  • Anomalies: All the pre-calculated Watchdog alerts matching the active search query that Watchdog found by scanning your organization’s data. Access the full list in the Watchdog Alert explorer.
  • Outliers: Calculated on the product data matching the active query, outliers surface tags that appear too frequently in some event types (for example, errors) or drive some continuous metrics upwards (for example, latency).
The log explorer showing the Watchdog Insights banner with five log anomalies

Explore insights

The Watchdog Insights carousel sits near the top of the following product pages:

Expand the carousel for an overview. The highest priority insights (based on Insight type, State, Status, Start time, Anomaly type) appear on the left.

The Watchdog Insights carousel on the Logs Explorer, showing three anomalies: new error logs in the web-store service, a spike in error logs in the product-recommendation service, and another spike in error logs in the product-recommendation service

Click View all to expand the panel. A side panel opens from the right, containing a vertical list of Watchdog Insights. Each entry shows a detailed view, with more information than the summary card.

Every outlier comes with embedded interactions and a side panel with troubleshooting information. Each Insight’s interactions and side panel vary based on the Watchdog Insight type.

Filter on Insight query

To refine your current view to match a Watchdog Insight, hover over the top right corner of an Insight summary card. Two icons appear. Click on the inverted triangle icon with the tooltip Filter on Insight. The page refreshes to show a list of entries corresponding to the insight.

Filtering the explorer on the insight context

Share an outlier

To share a given outlier, click on it in the insight panel to open the details side panel. Click the Copy Link button at the top of the details panel:

Outlier side panel showing how to copy the link

The link to the outlier expires with the retention of the underlying data. For instance, if the logs used to build the outlier are retained for 15 days, the link to the outlier expires with the logs after 15 days.

Outlier types

Error outliers

Error outliers display fields such as faceted tags or attributes containing characteristics of errors that match the current query. Statistically overrepresented key:value pairs among errors provide hints into the root causes of problems.

Typical examples of error outliers include env:staging, docker_image:acme:3.1, and http.useragent_details.browser.family:curl.

In the banner card view, you can see:

  • The field name
  • The proportion of errors and overall logs that the field contributes to
The error outlier card showing a red bar with 73.3% of total errors and a blue bar with 8.31% of total errors

In the full side panel view, you can see:

  • The timeseries of error logs that contain the field
  • Tags that are often associated with the error logs
  • A comprehensive list of log patterns
Error Outlier side panel

APM outliers are available on all APM pages where the Watchdog Insights carousel is available:

Error outliers

Error outliers display fields such as tags containing characteristics of errors that match the current query. Statistically overrepresented key:value pairs among errors provide hints into the root cause of problems.

Typical examples of error outliers include env:staging, availability_zone:us-east-1a, cluster_name:chinook, and version:v123456.

In the banner card view, you can see:

  • The field name
  • The proportion of errors and overall traces that the field contributes to
The error outlier card showing a red bar with 24.2% of total errors and a blue bar with 12.1% of total errors

In the full side panel view, you can see:

  • The timeseries of error traces that contain the field
  • Tags that are often associated with the error traces
  • A comprehensive list of related Error Tracking Issues and failing spans
Error Outlier side panel

Latency outliers

Latency outliers display fields such as tags that are associated with performance bottlenecks that match the current search query. key:value pairs with worse performance than the baseline can provide hints into the performance bottlenecks among a subset of APM spans.

Latency outliers are computed for the span duration.

In the banner card view, you can see:

  • The field name
  • The latency distribution for spans containing the tag and the baseline for the rest of the data
  • A percentile of interest latency value for the outlier tag and the difference with the baseline for the rest of the data
Latency Outlier banner card

In the full side panel, you can see a latency distribution graph for the tag and the baseline. The X axis has increments of p50, p75, p99, and max, along with a list of APM events that contain the field.

Latency Outlier full side panel view

Lock contention outlier

In the banner card view, you can see:

  • The name of the impacted service
  • The number of threads impacted
  • The potential CPU savings (and estimated cost savings)
Profiling insight on Lock Contention

In the full side panel, you can see instructions on how to resolve the lock contention:

Side panel with all the information on how to address the Lock Contention outlier

Garbage collection outlier

In the banner card view, you can see:

  • The name of the impacted service
  • The amount of CPU time used to perform garbage collection
Profiling insight on Garbage Collection

In the full side panel, you can see instructions on how to better configure garbage collection to free up some CPU time:

Side panel with all the information on how to address the Garbage Collection outlier

Regex compilation outlier

In the banner card view, you can see:

  • The name of the impacted service
  • The amount of CPU time spent on compiling regexes
Profiling insight on Regex Compilation

In the full side panel, you can see instructions on how to improve regex compilation time, as well as examples of functions within your code that could be improved:

Side panel with all the information on how to address the Regex Compilation outlier

For Database Monitoring, Watchdog surfaces insights on the following metrics:

  • CPU
  • Commits
  • IO
  • Background
  • Concurrency
  • Idle

Find the databases impacted by one or multiple outliers by using the Insight carousel.

Carousel to filter the Databases with Insights

An overlay is then set on the databases, with pink pills highlighting the different Insights and giving more information about what happened.

Watchdog insight overlay on the database to highlight what is happening

Error outlier

Error outliers display fields such as faceted tags or attributes that contain characteristics of errors that match the current search query. Statistically overrepresented key:value pairs among errors can provide hints into the root causes of issues. Typical examples of error outliers include env:staging, version:1234, and browser.name:Chrome.

In the banner card view, you can see:

  • The field name
  • The proportion of total errors and overall RUM events that the field contributes to
  • Related tags

In the full side panel, you can see a timeseries graph about the total number of RUM errors with the field, along with impact pie charts and a list of RUM events that contain the field.

Error Outlier full side panel

Latency outlier

Latency outliers display fields such as faceted tags or attributes that are associated with performance bottlenecks that match the current search query. key:value pairs with worse performance than the baseline can provide hints into the performance bottlenecks among a subset of real users.

Latency outliers are computed for Core Web Vitals such as First Contentful Paint, First Input Delay, Cumulative Layout Shift, and Loading Time. For more information, see Monitoring Page Performance.

In the banner card view, you can see:

  • The field name
  • The performance metric value containing the field and the baseline for the rest of the data

In the full side panel, you can see a timeseries graph about the performance metric. The X axis has increments of p50, p75, p99, and max, along with a list of RUM events that contain the field.

Latency Outlier full side panel view

For serverless infrastructures, Watchdog surfaces the following insights:

  • Cold Start Ratio Up/Down
  • Error Invocation Ratio Up/Down
  • Memory Usage Up/Down
  • OOM Ratio Up/Down
  • Estimated Cost Up/Down
  • Init Duration Up/Down
  • Runtime Duration Up/Down

Find the serverless functions impacted by one or multiple outliers by using the Insights carousel.

Facet to filter the Serverless Functions with insights

An overlay is then set on the function, with pink pills highlighting the different insights and giving more information about what happened.

Watchdog insight overlay on the function to highlight what is happening

For Process Explorer, the Watchdog Insight carousel reflects all Process anomalies for the current context of the Process Explorer.

For Kubernetes Explorer, the Watchdog Insight carousel reflects all the Kubernetes anomalies for the current context of the Kubernetes Explorer.

Further reading