Troubleshooting Monitor Alerts

Docs > Monitors > Monitor Guides > Troubleshooting Monitor Alerts

Overview

This guide provides an overview of some foundational concepts that can help you determine if your monitor’s alerting behavior is valid. If you suspect that your monitor’s evaluations are not accurately reflecting the underlying data, use this guide to inspect your monitor and troubleshoot the following:

The monitor state or status is not matching up to the evaluation
Verify that the data is present
Alert configurations
Unwanted notifications

Monitor state and monitor status

While monitor evaluations are stateless, meaning that the result of a given evaluation does not depend on the results of previous evaluations, monitors themselves are stateful, and their state is updated based on the evaluation results of their queries and configurations. A monitor evaluation with a given status won’t necessarily cause the monitor’s state to change to the same status. See below for some potential causes:

Metrics are too sparse within a metric monitor’s evaluation window

If metrics are absent from a monitor’s evaluation window, and the monitor is not configured to anticipate no-data conditions, the evaluation may be skipped. In such a case, the monitor state is not updated, so a monitor previously in the OK state remains OK, and likewise with a monitor in the Alert state. Use the history graph on the monitor status page and select the group and time frame of interest. If data is sparsely populated, see monitor arithmetic and sparse metrics for more information.

Monitor state updates due to external conditions

The state of a monitor may also sometimes update in the absence of a monitor evaluation, for example, due to auto-resolve.

“No Data” status using Rollup function

If your monitors are unexpectedly evaluating in a “No Data” status, consider reviewing your settings for rollups and evaluation windows. For instance, if a monitor has a 4-minute rollup and a 20-minute evaluation window, it produces one data point every 4 minutes, leading to a maximum of 5 data points within the window. If the “Require Full Window” option is enabled, the evaluation may result in “No Data” because the window is not fully populated.

For most use cases, disable the “Require Full Window” setting unless your specific scenario demands complete data for accurate evaluation. For more information, see Rollups in monitors.

Verify the presence of data

If your monitor’s state or status is not what you expect, confirm the behavior of the underlying data source. For a metric monitor, you can use the history graph to view the data points being pulled in by the metric query. N/A groups are not included in monitors but are visible in dashboard queries.

Alert conditions

Unexpected monitor behavior can sometimes be the result of misconfigured alert conditions, which vary by monitor type. If your monitor query uses the as_count() function, check the as_count() in Monitor Evaluations guide.

If using recovery thresholds, check the conditions listed in the recovery thresholds guide to see if the behavior is expected.

Monitor status and groups

For both monitor evaluations and state, status is tracked by group.

For a multi alert monitor, a group is a set of tags with one value for each grouping key (for example, env:dev, host:myhost for a monitor grouped by env and host). For a simple alert, there is only one group (*), representing everything within the monitor’s scope.

By default, Datadog keeps monitor groups available in the UI for 24 hours, or 48 hours for host monitors, unless the query is changed. See Monitor settings changes not taking effect for more information.

If you anticipate creating new monitor groups within the scope of your multi alert monitors, you may want to configure a delay for the evaluation of these new groups. This can help you avoid alerts from the expected behavior of new groups, such as high resource usage associated with the creation of a new container. Read new group delay for more information.

If your monitor queries for crawler-based cloud metrics, use an evaluation delay to ensure that the metrics have arrived before the monitor evaluates. Read cloud metric delay for more information about cloud integration crawler schedules.

Notification issues

If your monitor is behaving as expected, but producing unwanted notifications, there are multiple options to reduce or suppress notifications:

For monitors that rapidly change between states, read reduce alert flapping for ways to minimize alert fatigue.
For alerts which are expected or are otherwise not useful for your organization, use Downtimes to suppress unwanted notifications.
To control alert routing, use template variables and the separation of warning or alert states with conditional variables.

Absent notifications

If you suspect that notifications are not being properly delivered, check the items below to ensure that notifications are able to be delivered:

Check email preferences for the recipient and ensure that Notification from monitor alerts is checked.
Check the event stream for events with the string Error delivering notification.

Opsgenie multi-notification

If you are using multiple @opsgenie-[...] notifications in your monitor, we send those notifications with the same alias to Opsgenie. Due to an Opsgenie feature, Opsgenie will discard what is seen as a duplication.