Burn Rate Alerts

Burn Rate Alerts

This feature is in open beta. Email slo-help@datadoghq.com to ask questions or to provide feedback on this feature.

Overview

SLO burn rate alerts notify you when the rate of consumption of your SLO error budget has exceeded your specified threshold and is sustained for a specific period of time. For example, you can set an alert if a burn rate of 14.4 or more is measured for the past hour over the past 5 minutes for your SLO’s 30-day target. And you can set it to optionally warn you for a slightly lower threshold than you would want an alert, for example if a burn rate of 7.2 or more is observed.

Note: Burn rate alerts are only available for metric-based SLOs or for monitor-based SLOs that are only composed of Metric Monitor types (Metric, Integration, APM Metric, Anomaly, Forecast, or Outlier Monitors).

How Burn Rate Alerts work

A burn rate is a unitless value coined by Google that indicates how fast your error budget is consumed relative to your SLO’s target length. For example, a 30-day target, a burn rate of 1 means your error budget would be fully consumed in exactly 30 days if the rate of 1 was kept constant. A burn rate of 2 means the error budget would be exhausted in 15 days if kept constant, and a burn rate of 3 means 10 days, etc.

This relationship is represented by the following formula:

$${\text"length of SLO target" \text" (7, 30 or 90 days)"} / \text"burn rate" = \text"time until error budget is fully consumed"$$

A burn rate alert will use the recent “error rate” in its calculation to measure the observed burn rate. Note that “error rate” means the ratio of bad behavior over total behavior during a given period:

The units of “behavior” will differ depending on the type of SLO. Metric-based SLOs track the number of occurrences of something (like number of successful or failed requests), while monitor-based SLOs track amounts of time (like downtime and uptime of monitors).

When you set a target for your SLO (like 99.9%), your error budget is the amount of unreliability you’re allowed to have:

In other words, your error budget (in fractional form) is the ideal error rate you should be maintaining. So, a burn rate can alternatively be interpreted as a multiplier of your ideal error rate. For example, for a 99.9% SLO over 30 days, if the SLO is experiencing a burn rate of 10 that means the error budget is on pace to be completely depleted in 3 days and that the observed error rate is 10 times the ideal error rate:

Ideally, you should always try to maintain a burn rate of 1 over the course of your SLO’s target (as you invest in evolving your application with new features). However, in practice, your burn rate will fluctuate as issues or incidents cause your burn rate to increase rapidly until the issue is resolved. Therefore, alerting on burn rates allows you to be proactively notified when an issue is consuming your error budget at an elevated rate that could potentially cause you to miss your SLO target.

When you configure a burn rate alert, you specify the burn rate threshold alongside a “long alerting window” and “short alerting window” over which the observed burn rate will be measured. The long alerting window is specified in hours and ensures the monitor measures the burn rate over a period long enough to correspond to a significant issue. This prevents the monitor from triggering flaky alerts due to minor issues. The short alerting window is specified in minutes. It ensures the monitor recovers quickly after the actual issue is over by checking if the recent burn rate is still above the threshold. Google recommends the short window to be 1/12 of the long window. However, you will be able to customize the short window programmatically in Datadog through the API or with Terraform. Here is the full formula for how the burn rate alert evaluates:

Maximum burn rate values

As noted above, you can use this formula to evaluate the observed burn rate for both the long window and short window:

The maximum error rate that you can ever observe is 1 (for example, when 100% of the total behavior is bad behavior during the given time period). This means that there is a maximum possible burn rate value that you can use in your burn rate alerts:

The lower your SLO target, the lower your maximum possible burn rate value. If you were to attempt to set a burn rate threshold higher than this value, it would be impossible for the alert to trigger. If you set a burn rate alert’s condition to a value higher than the maximum determined by the above formula, you’re telling the burn rate alert to notify you when your SLO is seeing an error rate greater than 100% (which is impossible). So, to avoid unhelpful alerts from being accidentally created, Datadog blocks the creation of burn rate alerts that set a burn rate value beyond their maximum.

Picking burn rate values

Picking burn rate values to alert off of depends on the target and time window your SLO uses. When you configure a burn rate alert, your main focus should be on setting the burn rate threshold itself and setting the long window. Datadog recommends initially keeping the short window as 1/12 of the long window, as Google suggests, and then adjust the value if needed after using the alert. Your maximum possible burn rate will be bounded by the relationship described in the previous section.

Approach #1: Time to error budget depletion

For the burn rate threshold, recall the previous relationship:

Solve for burn rate and pick a time until the error budget is fully consumed that would qualify as a significant issue.

For the long window, choose a period of time that an elevated burn rate would have to be sustained to indicate a real issue rather than a minor transient issue. The higher the burn rate you select, the smaller a long window you should pair it with (so that high severity issues are caught sooner).

Approach #2: Theoretical error budget consumption

Alternatively, you may think of a burn rate and long window pairing in terms of theoretical error budget consumption:

For example, for a 7-day SLO, to be alerted if the theoretical error budget consumption is 10% with 1 hour as your long window, the selected burn rate should be:

Note: For metric-based SLOs, the relationship in Approach #2 extrapolates the total number of occurrences contained in the long window out to the full length of the SLO target. In practice, the error budget consumption observed won’t correspond exactly to this relationship, as the total occurrences tracked by the metric-based SLO in a rolling window will likely differ throughout the day. A burn rate alert is meant to predict significant amounts of error budget consumption before they occur. For monitor-based SLOs, theoretical error budget consumption and actual error budget consumption are equal because time always moves at a constant rate . For example, 60 minutes of monitor data is aways contained in the 1 hour window.

Monitor creation

  1. Navigate to the SLO status page.
  2. Create a new SLO or edit an existing one, then click the Save and Set Alert button. For existing SLOs, you can also click the Set up Alerts button in the SLO detail side panel to take you directly to the alert configuration.
  3. Select the Burn Rate tab in Step 1: Setting alerting conditions
  4. Set an alert to trigger when a certain burn rate is measured during a specific long window:
    • The burn rate value must be in the range
    • The long window value is limited to: 1 hour <= long window <= 48 hours.
    • In the UI the short window is automatically calculated as: short window = 1/12 * long window.
    • You can specify a different short window value using the API or Terraform, but it must always be less than the long window.
  5. Add Notification information into the Say what’s happening and Notify your team sections.
  6. Click the Save and Exit button on the SLO configuration page.

Examples

Below are tables of Datadog’s recommended values for 7, 30, and 90-day targets.

  • These examples assume a 99.9% target, but they are reasonable for targets as low as 96% (the max burn rate for 96% is 25). However, if you are using lower targets you may require lower thresholds as described in the Maximum Burn Rate Values section, Datadog recommends that you use Approach #2 with either a smaller value for theoretical error budget consumed or a higher value for the long window.
  • For metric-based SLOs, the theoretical error budget consumed is calculated by extrapolating the number of total occurrences observed in the long alerting window out to the total length of the SLO target.

For 7-day targets:

Burn rateLong windowShort windowTheoretical error budget consumed
16.81 hour5 minutes10%
5.66 hours30 minutes20%
2.824 hours120 minutes40%

For 30-day targets:

Burn rateLong windowShort windowTheoretical error budget consumed
14.41 hour5 minutes2%
66 hours30 minutes5%
324 hours120 minutes10%

For 90-day targets:

Burn rateLong windowShort windowTheoretical error budget consumed
21.61 hour5 minutes1%
10.86 hours30 minutes3%
4.524 hours120 minutes5%

Recommendation: If you find that your burn rate alert is consistently too flaky, this is an indication that you should make your short window slightly larger. However, note that the larger you make your short window, the slower the monitor will be in recovering after an issue has ended.

API and Terraform

You can create SLO burn rate alerts using the create-monitor API endpoint. Below is an example query for a burn rate alert, which alerts when a burn rate of 14.4 is measured for the past hour and past 5 minutes. Replace slo_id with the alphanumeric ID of the SLO you wish to configure a burn rate alert on and replace time_window with one of 7d, 30d or 90d - depending on which target is used to configure your SLO:

burn_rate("slo_id").over("time_window").long_window("1h").short_window("5m") > 14.4

In addition, SLO burn rate alerts can also be created using the datadog_monitor resource in Terraform. Below is an example .tf for configuring a burn rate alert for a metric-based SLO using the same example query as above.

Note: SLO burn rate alerts are only supported in Terraform provider v2.7.0 or earlier and in provider v2.13.0 or later. Versions between v2.7.0 and v2.13.0 are not supported.

resource "datadog_monitor" "metric-based-slo" {
    name = "SLO Burn Rate Alert Example"
    type  = "slo alert"
    
    query = <<EOT
    burn_rate("slo_id").over("time_window").long_window("1h").short_window("5m") > 14.4
    EOT

    message = "Example monitor message"
    monitor_thresholds = {
      critical = 14.4
    }
    tags = ["foo:bar", "baz"]
}

Beta restrictions

  • Alerting is available only for metric-based SLOs or for monitor-based SLOs that are only composed of Metric Monitor types (Metric, Integration, APM Metric, Anomaly, Forecast, or Outlier Monitors).
  • The alert status of an SLO monitor is available in the Alerts tab in the SLO’s detail panel or the Manage Monitors page.