To build an SLO from new or existing Datadog monitors, create a monitor-based SLO.
Time-based data sets usually map well to monitor-based SLOs. Using a monitor-based SLO, you can calculate the Service Level Indicator (SLI) by dividing the amount of time your system exhibits good behavior by the total time.
To create a monitor-based SLO, you need an existing Datadog monitor. To set up a new monitor, go to the monitor creation page.
Datadog monitor-based SLOs support the following monitor types:
- Metric Monitor Types (Metric, Integration, APM Metric, Anomaly, Forecast, Outlier)
- Service Checks (open beta)
On the SLO status page, select New SLO.
Under Define the source, select Monitor Based.
In the search box, start typing the name of a monitor. A list of matching monitors appears. Click on a monitor name to add it to the source list.
If you’re only using a single multi alert monitor in an SLO, you can optionally select “Calculate on selected groups” and pick up to 20 groups. Group selection is not supported for SLOs that contain multiple monitors.
Set your SLO targets
Select a target percentage, time window, and optional warning level.
The target percentage specifies the portion of time the underlying monitor(s) of the SLO should not be in the ALERT state.
The time window specifies the rolling period the SLO runs its calculation.
Depending on the value of the SLI, the Datadog UI displays the SLO status in a different color:
- While the SLI remains above the target, the UI displays the SLO status in green.
- When the SLI falls below the target, the UI displays the SLO status in red.
- If you included a warning level, and the SLI falls below the warning, but above the target level, the UI displays the SLO status in yellow.
The time window you choose changes the available precision for your monitor-based SLOs:
- 7-day and 30-day time windows allow up to two decimal places.
- 90-day time windows allow up to three decimal places.
In the details UI for the SLO, Datadog displays two decimal places for SLOs configured with 7-day and 30-day time windows and three decimal places for SLOs configured with 90-day time windows.
The following example demonstrates why Datadog displays a limited number of decimal places for SLO calculations. A 99.999% target for a 7-day or 30-day time window results in an error budget of 6 seconds or 26 seconds, respectively. Monitors evaluate every minute, so the granularity of a monitor-based SLO is also 1 minute. Therefore, one alert would fully consume and overspend the 6 second or 26 second error budget in the previous example. In practice, teams cannot satisfy such small error budgets.
If you need finer granularity than the once a minute monitor evaluation, consider using metric-based SLOs instead.
Choose a name and extended description for your SLO. Select any tags you would like to associate with your SLO.
Select Save & Exit to save your new SLO.
Datadog calculates the overall SLO status as the uptime percentage across all monitors or monitor groups, unless specific groups have been selected. If specific groups have been selected, the SLO status is calculated with only those groups. If no specific groups are selected, the UI displays the five groups with the worst statuses.
Monitor-based SLOs treat the
WARN state as
OK. The definition of an SLO requires a binary distinction between good and bad behavior. SLO calculations treat
WARN as good behavior since
WARN is not severe enough to indicate bad behavior.
Consider the following example for a monitor-based SLO containing 3 monitors. The calculation for a monitor-based SLO based on a single multi alert monitor would look similar.
In this example, the overall status is lower than the average of the individual statuses.
Exceptions for synthetic tests
In certain cases, there is an exception to the status calculation for monitor-based SLOs that are comprised of one grouped Synthetic test. Synthetic tests have optional special alerting conditions that change the behavior of when the test enters the ALERT state and consequently impact the overall uptime:
- Wait until the groups are failing for a specified number of minutes (default: 0)
- Wait until a specified number of the groups are failing (default: 1)
- Retry a specified number of times before a location’s test is considered a failure (default: 0)
If you change any of these conditions to something other than their defaults, the overall status for a monitor-based SLO using one Synthetic test could appear better than the aggregated statuses of the Synthetic test’s individual groups.
For more information on Synthetic test alerting conditions, see Synthetic Monitoring.
Impact of manual and automatic monitor updates
When a monitor is resolved manually or as a result of the After x hours automatically resolve this monitor from a triggered state setting, SLO calculations do not change. If these are important tools for your workflow, consider cloning your monitor, removing auto-resolve settings and
@-notification settings, and using the clone for your SLO.
Datadog recommends against using monitors with
Alert Recovery Threshold and
Warning Recovery Threshold to underlie an SLO. These settings make it difficult to cleanly differentiate between an SLI’s good behavior and bad behavior.
Muting a monitor does not affect the SLO calculation.
To exclude time periods from an SLO calculation, use the SLO status corrections feature.
When you create a metric monitor or service check, you choose whether it sends an alert when data is missing. This configuration affects how a monitor-based SLO calculation interprets missing data. For monitors configured to ignore missing data, time periods with missing data are treated as OK (uptime) by the SLO. For monitors configured to alert on missing data, time periods with missing data are treated as ALERT (downtime) by the SLO.
If you pause a Synthetic test, the SLO removes the time period with missing data from its calculation. In the UI, these time periods are marked light gray on the SLO status bar.
Underlying monitor and SLO histories
SLOs based on the metric monitor types have a feature called SLO Replay that backfills SLO statuses with historical data pulled from the underlying monitors’ metrics and query configurations. When you create a new Metric Monitor and set an SLO on that new monitor, you do not have to wait a full 7, 30, or 90 days to view the SLO status. Instead, SLO Replay triggers when you create the new SLO and looks at the history of the monitor’s underlying metric and query to fill in the status.
SLO Replay also triggers when you change the underlying metric monitor’s query to correct the status based on the new monitor configuration. As a result of SLO Replay recalculating an SLO’s status history, the monitor’s status history and the SLO’s status history may not match after a monitor update.
Note: SLOs based on Synthetic tests or Service Checks do not support SLO Replay.
Confirm you are using the preferred SLI type for your use case. Datadog supports monitor-based SLIs and metric-based SLIs.
Additional helpful documentation, links, and articles: