Set Up Failure Data for DORA Metrics

Docs > DORA メトリクス > DORA メトリクスのセットアップ > Set Up Failure Data for DORA Metrics

このページは日本語には対応しておりません。随時翻訳に取り組んでいます。
翻訳に関してご質問やご意見ございましたら、お気軽にご連絡ください。

Overview

Failure events are used to compute change failure rate and time to restore.

Selecting and configuring a failure data source

DORA Metrics can automatically identify and track failures through Datadog Incidents. After incidents are declared, DORA uses them to measure change failure rate and time to restore.

Note: The time to restore is measured as the total duration an incident spends in the active state. For cases like active → stable → active → stable, it includes all active periods. The time to restore is shown only when an incident is in a stable or resolved state. If a resolved incident is reactivated, the metric is hidden until it’s resolved again.

Requirements

Incidents is enabled as a Failures event data source in DORA settings.

To avoid having unlabeled failures, Datadog strongly recommends adding the following attributes to incidents:

Teams
Services
Envs: The Envs attribute can be added in the Incident Settings if it doesn’t already exist.

If provided with incidents, the Severity tag is added to failure events.

Recommended: In the Incident Settings, set attributes field Prompted to At Resolution to ensure you never forget to add these attributes to your incidents.

Include historical incidents

You can retroactively include incidents from the past two years by selecting Backfill Data in the DORA settings, which creates failures from those incidents. Backfilling data can take up to an hour to complete.

PagerDuty is an incident management platform that equips IT teams with immediate incident visibility, enabling proactive and effective responses to maintain operational stability and resilience.

To integrate your PagerDuty account with DORA Metrics:

Enable PagerDuty as a Failures event data source in DORA settings.
Navigate to Integrations > Developer Tools in PagerDuty and click Generic Webhooks (v3).

Click + New Webhook and enter the following details:

Variable	Description
Webhook URL	Add `https://webhook-intake./api/v2/webhook/`.
Scope Type	Select the scope of which incidents you want to send. You can send incidents for a specific Service or Team, or all PagerDuty services in your Account. Depending on your environment and access level, some scope types may not be available.
Description	A description helps distinguish the webhook. Add something like `Datadog DORA Metrics integration`.
Event Subscription	Select the following events: -`incident.acknowledged` -`incident.annotated` -`incident.custom_field_values.updated` -`incident.delegated` -`incident.escalated` -`incident.priority_updated` -`incident.reassigned` -`incident.reopened` -`incident.resolved` -`incident.triggered` -`incident.unacknowledged`
Custom Headers	Click Add custom header, enter `DD-API-KEY` as the name, and input your Datadog API key as the value. Optionally, you can add an environment to all of the PagerDuty incidents sent from the webhook by creating an additional custom header with the name `dd_env` and the desired environment as the value.

To save the webhook, click Add Webhook.

The severity of the failure in the DORA Metrics product is based on the incident priority in PagerDuty.

Note: Upon webhook creation, a new secret is created and used to sign all the webhook payloads. That secret is not needed for the integration to work, as the authentication is performed using the API key instead.

Mapping PagerDuty services to Datadog services

When an incident event is received for a specific PagerDuty service, Datadog attempts to retrieve the related Datadog service and team from any triggering Datadog monitors and from the Software Catalog.

The matching algorithm works in the following steps:

If the PagerDuty incident event was triggered from a Datadog monitor:
- If the monitor is in Multi Alert mode, the incident metrics and events are emitted with the env, service, and team from the alerted group.
- If the monitor has tags for env, service, or team:
  - env: If the monitor has a single env tag, the incident metrics and events are emitted with the environment.
  - service: If the monitor has one or more service tags, the incident metrics and events are emitted with the provided services.
  - team: If the monitor has a single team tag, the incident metrics and events are emitted with the team.
If the service URL of the incident matches the PagerDuty service URL for any services in the Software Catalog:
- If a single Datadog service matches, the incident metrics and events are emitted with the service and team.
- If multiple Datadog services match, the incident metrics and events are emitted with the team.
For more information about setting the PagerDuty service URL for a Datadog service, see Use Integrations with Software Catalog.
If the PagerDuty service name of the incident matches a service name in the Software Catalog, the incident metrics and events are emitted with the service and team.
If the PagerDuty team name of the incident matches a team name in the Software Catalog, the incident metrics and events are emitted with the team.
If the PagerDuty service name of the incident matches a team name in the Software Catalog, the incident metrics and events are emitted with the team.
If there have been no matches up to this point, the incident metrics and events are emitted with the PagerDuty service and PagerDuty team provided in the incident.

If an incident is resolved manually in PagerDuty instead of from a monitor notification, the incident resolution event does not contain monitor information and the first step of the matching algorithm is skipped.

To send your own failure events, use the DORA Metrics API. Failure events are used in order to calculate change failure rate and time to restore.

Include the finished_at attribute in a failure event to mark that the failure is resolved. You can send events at the start of the failure and after it has been resolved. Failure events are matched by the env, service and started_at attributes.

Requirements

datadog-ci CLI / API is enabled as a Failures event data source in DORA settings.
The following attributes are required:
- services or team (at least one must be present)
- started_at

You can optionally add the following attributes to the failure events:

finished_at for resolved failures. Required for calculating time to restore
id for identifying failures. This attribute is user-generated; when not provided, the endpoint returns a Datadog-generated UUID.
name to describe the failure.
severity
env to filter your DORA metrics by environment on the DORA Metrics page.
repository_url
commit_sha
version
custom_tags: Tags in the form key:value that can be used to filter events on the DORA Metrics page.

See the DORA Metrics API reference documentation for the full spec and additional code samples.

API (cURL) Example

For the following configuration, replace <DD_SITE> with :

curl -X POST "https://api.<DD_SITE>/api/v2/dora/failure" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "DD-API-KEY: ${DD_API_KEY}" \
  -d @- << EOF
  {
    "data": {
      "attributes": {
        "services": ["shopist"],
        "team": "shopist-devs",
        "started_at": 1693491974000000000,
        "finished_at": 1693491984000000000,
        "git": {
          "commit_sha": "66adc9350f2cc9b250b69abddab733dd55e1a588",
          "repository_url": "https://github.com/organization/example-repository"
        },
        "env": "prod",
        "name": "Web server is down failing all requests",
        "severity": "High",
        "version": "v1.12.07",
        "custom_tags": ["department:engineering", "app_type:backend"]
      }
    }
  }
EOF

Calculating change failure rate

Change failure rate requires both deployment data and failure data.

Change failure rate is calculated as the percentage of failure events out of the total number of deployments. Datadog divides Count of Failures over Count of Deployments for the same services and/or teams associated to both a failure and a deployment event.

Calculating time to restore

Time to restore is calculated as the duration distribution for resolved failure events.

DORA Metrics generates the Time to Restore metric by recording the start and end times of each failure event. It calculates the time to restore as the median of these Time to Restore data points over a selected time frame.

Custom tags

If the services associated with the failure are registered in the Software Catalog with metadata set up (see Adding Metadata), the languages of the services and any tags are automatically retrieved and associated with the failure event.