Monitor-based SLOs

Overview

To build an SLO from new or existing Datadog monitors, create a monitor-based SLO.

Time-based data sets usually map well to monitor-based SLOs. Using a monitor-based SLO, you can calculate the Service Level Indicator (SLI) by dividing the amount of time your system exhibits good behavior by the total time.

monitor-based SLO example

Prerequisites

To create a monitor-based SLO, you need an existing Datadog monitor. To set up a new monitor, go to the monitor creation page.

Datadog monitor-based SLOs support the following monitor types:

  • Metric Monitor Types (Metric, Integration, APM Metric, Anomaly, Forecast, Outlier)
  • Synthetic
  • Service Checks (open beta)

Setup

On the SLO status page, select New SLO.

Under Define the source, select Monitor Based.

Define queries

In the search box, start typing the name of a monitor. A list of matching monitors appears. Click on a monitor name to add it to the source list.

Set your SLO targets

Select a target percentage, time window, and optional warning level.

The target percentage specifies the portion of time the underlying monitor(s) of the SLO should not be in the ALERT state.

The time window specifies the rolling period the SLO runs its calculation.

Depending on the value of the SLI, the Datadog UI displays the SLO status in a different color:

  • While the SLI remains above the target, the UI displays the SLO status in green.
  • When the SLI falls below the target, the UI displays the SLO status in red.
  • If you included a warning level, and the SLI falls below the warning, but above the target level, the UI displays the SLO status in yellow.

The time window you choose changes the available precision for your monitor-based SLOs:

  • 7-day and 30-day time windows allow up to two decimal places.
  • 90-day time windows allow up to three decimal places.

In the details UI for the SLO, Datadog displays two decimal places for SLOs configured with 7-day and 30-day time windows and three decimal places for SLOs configured with 90-day time windows.

The following example demonstrates why Datadog displays a limited number of decimal places for SLO calculations. A 99.999% target for a 7-day or 30-day time window results in an error budget of 6 seconds or 26 seconds, respectively. Monitors evaluate every minute, so the granularity of a monitor-based SLO is also 1 minute. Therefore, one alert would fully consume and overspend the 6 second or 26 second error budget in the previous example. In practice, teams cannot satisfy such small error budgets.

If you need finer granularity than the once a minute monitor evaluation, consider using metric-based SLOs instead.

Add name and tags

Choose a name and extended description for your SLO. Select any tags you would like to associate with your SLO.

Select Save & Exit to save your new SLO.

Status calculation

SLO detail showing 99 percent green with 8 groups aggregated

Datadog calculates the overall status as the percentage of the time where all monitors or all the calculated groups in a single multi-alert monitor are NOT in the ALERT state. Datadog does not calculate the average of the aggregated monitors or the aggregated groups.

Monitor-based SLOs treat the WARN state as OK. The definition of an SLO requires a binary distinction between good and bad behavior. SLO calculations treat WARN as good behavior since WARN is not severe enough to indicate bad behavior.

Consider the following example for a monitor-based SLO containing 3 monitors. The calculation for a monitor-based SLO based on a single multi-alert monitor would look similar.

Monitort1t2t3t4t5t6t7t8t9t10Status
Monitor 1OKOKOKOKALERTOKOKOKOKOK90%
Monitor 2OKOKOKOKOKOKOKOKALERTOK90%
Monitor 3OKOKALERTOKALERTOKOKOKOKOK80%
Overall StatusOKOKALERTOKALERTOKOKOKALERTOK70%

In this example, the overall status is lower than the average of the individual statuses.

Exceptions for synthetic tests

In certain cases, there is an exception to the status calculation for monitor-based SLOs that are comprised of one grouped Synthetic test. Synthetic tests have optional special alerting conditions that change the behavior of when the test enters the ALERT state and consequently impact the overall uptime:

  • Wait until the groups are failing for a specified number of minutes (default: 0)
  • Wait until a specified number of the groups are failing (default: 1)
  • Retry a specified number of times before a location’s test is considered a failure (default: 0)

If you change any of these conditions to something other than their defaults, the overall status for a monitor-based SLO using one Synthetic test could appear better than the aggregated statuses of the Synthetic test’s individual groups.

For more information on Synthetic test alerting conditions, see Synthetic Monitoring.

Impact of manual and automatic monitor updates

When a monitor is resolved manually or as a result of the After x hours automatically resolve this monitor from a triggered state setting, SLO calculations do not change. If these are important tools for your workflow, consider cloning your monitor, removing auto-resolve settings and @-notification settings, and using the clone for your SLO.

Datadog recommends against using monitors with Alert Recovery Threshold and Warning Recovery Threshold to underlie an SLO. These settings make it difficult to cleanly differentiate between an SLI’s good behavior and bad behavior.

Muting a monitor does not affect the SLO calculation.

To exclude time periods from an SLO calculation, use the SLO status corrections feature.

Missing data

When you create a monitor, you choose whether it sends an alert when data is missing. This configuration affects how a monitor-based SLO calculation interprets missing data.

For monitors configured to ignore missing data, time periods with missing data are treated as OK by the SLO. For monitors configured to alert on missing data, time periods with missing data are treated as ALERT by the SLO.

Underlying monitor and SLO histories

SLOs based on the metric monitor types have a feature called SLO Replay that backfills SLO statuses with historical data pulled from the underlying monitors’ metrics and query configurations. When you create a new Metric Monitor and set an SLO on that new monitor, you do not have to wait a full 7, 30, or 90 days to view the SLO status. Instead, SLO Replay triggers when you create the new SLO and looks at the history of the monitor’s underlying metric and query to fill in the status.

SLO Replay also triggers when you change the underlying metric monitor’s query to correct the status based on the new monitor configuration. As a result of SLO Replay recalculating an SLO’s status history, the monitor’s status history and the SLO’s status history may not match after a monitor update.

Note: SLOs based on Synthetic tests or Service Checks do not support SLO Replay.

Other considerations

Confirm you are using the preferred SLI type for your use case. Datadog supports monitor-based SLIs and metric-based SLIs.

Further Reading