How to Set Up Incident Data for DORA Metrics
DORA Metrics is not available in the selected site () at this time.
DORA Metrics is in Preview.
Overview
Failed deployments events, currently interpreted through incident events, are used to compute change failure rate and mean time to restore (MTTR).
Selecting and configuring an incident data source
PagerDuty is an incident management platform that equips IT teams with immediate incident visibility, enabling proactive and effective responses to maintain operational stability and resilience.
To integrate your PagerDuty account with DORA Metrics:
Navigate to Integrations > Developer Tools in PagerDuty and click Generic Webhooks (v3).
Click + New Webhook and enter the following details:
Variable | Description |
---|
Webhook URL | Add https://webhook-intake./api/v2/webhook/ . |
Scope Type | Select Account to send incidents for all PagerDuty services in your account. Alternatively, you can send incidents for specific services or teams by selecting a different scope type. |
Description | A description helps distinguish the webhook. Add something like Datadog DORA Metrics integration . |
Event Subscription | Select the following events: -incident.acknowledged -incident.annotated -incident.custom_field_values.updated -incident.delegated -incident.escalated -incident.priority_updated -incident.reassigned -incident.reopened -incident.resolved -incident.triggered -incident.unacknowledged |
Custom Headers | Click Add custom header, enter DD-API-KEY as the name, and input your Datadog API key as the value.
Optionally, you can add an environment to all of the PagerDuty incidents sent from the webhook by creating an additional custom header with the name dd_env and the desired environment as the value. |
To save the webhook, click Add Webhook.
The severity of the incident in the DORA Metrics product is based on the incident priority in PagerDuty.
Note: Upon webhook creation, a new secret is created and used to sign all the webhook payloads. That secret is not needed for the integration to work, as the authentication is performed using the API key instead.
When an incident event is received for a specific PagerDuty service, Datadog attempts to retrieve the related Datadog service and team from any triggering Datadog monitors and from the Service Catalog.
The matching algorithm works in the following steps:
If the PagerDuty incident event was triggered from a Datadog monitor:
- If the monitor is in Multi Alert mode, the incident metrics and events are emitted with the
env
, service
, and team
from the alerted group. - If the monitor has tags for
env
, service
, or team
:env
: If the monitor has a single env
tag, the incident metrics and events are emitted with the environment.service
: If the monitor has one or more service
tags, the incident metrics and events are emitted with the provided services.team
: If the monitor has a single team
tag, the incident metrics and events are emitted with the team.
If the service URL of the incident matches the PagerDuty service URL for any services in the Service Catalog:
- If a single Datadog service matches, the incident metrics and events are emitted with the service and team.
- If multiple Datadog services match, the incident metrics and events are emitted with the team.
For more information about setting the PagerDuty service URL for a Datadog service, see Use Integrations with Service Catalog.
If the PagerDuty service name of the incident matches a service name in the Service Catalog, the incident metrics and events are emitted with the service and team.
If the PagerDuty team name of the incident matches a team name in the Service Catalog, the incident metrics and events are emitted with the team.
If the PagerDuty service name of the incident matches a team name in the Service Catalog, the incident metrics and events are emitted with the team.
If there have been no matches up to this point, the incident metrics and events are emitted with the PagerDuty service and PagerDuty team provided in the incident.
To send your own incident events, use the DORA Metrics API. Incident events are used in order to compute change failure rate and mean time to restore.
Include the finished_at
attribute in an incident event to mark that the incident is resolved. You can send events at the start of the incident and after incident resolution. Incident events are matched by the env
, service
, and started_at
attributes.
The following attributes are required:
services
or team
(at least one must be present)started_at
You can optionally add the following attributes to the incident events:
finished_at
for resolved incidents. This attribute is required for calculating the time to restore service.id
for identifying incidents when they are created and resolved. This attribute is user-generated; when not provided, the endpoint returns a Datadog-generated UUID.name
to describe the incident.severity
env
to filter your DORA metrics by environment on the DORA Metrics page.repository_url
commit_sha
See the DORA Metrics API reference documentation for the full spec and additional code samples.
API (cURL) Example
For the following configuration, replace <DD_SITE>
with :
curl -X POST "https://api.<DD_SITE>/api/v2/dora/incident" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: ${DD_API_KEY}" \
-d @- << EOF
{
"data": {
"attributes": {
"services": ["shopist"],
"team": "shopist-devs",
"started_at": 1693491974000000000,
"finished_at": 1693491984000000000,
"git": {
"commit_sha": "66adc9350f2cc9b250b69abddab733dd55e1a588",
"repository_url": "https://github.com/organization/example-repository"
},
"env": "prod",
"name": "Web server is down failing all requests",
"severity": "High"
}
}
}
EOF
Calculating change failure rate
Change failure rate requires both deployment data and incident data.
Change failure rate is calculated as the percentage of incident events out of the total number of deployments. Datadog divides dora.incidents.count
over dora.deployments.count
for the same services and/or teams associated to both an failure and a deployment event.
Calculating time to restore
Time to restore is calculated as the duration distribution for resolved incident events.
DORA Metrics generates the dora.time_to_restore
metric by recording the start and end times of each incident event. It calculates the mean time to restore (MTTR) as the average of these dora.time_to_restore
data points over a selected time frame.
Further Reading
Additional helpful documentation, links, and articles: