Incident Management Analytics

Docs > Incident Response > Incident Management > Incident Management Analytics

Overview

Use Incident Analytics to learn from past incidents and understand the efficiency and performance of your incident response process. Incident analytics allows you to pull aggregated statistics on your incidents over time. You can use these statistics to create reports that help you to:

Analyze whether your incident response process is improving over time
Assess your mean time to resolution
Identify areas of improvement that you should invest in

Data collected

Incident Management Analytics is a queryable data source for aggregated incident statistics. You can query these analytics in a variety of graph widgets in both Dashboards and Notebooks to analyze the history of your incident response over time. To give you a starting point, Datadog provides the following out-of-the-box resources that you can clone and customize:

Incident timestamps

Incidents carry three timestamp attributes that influence analytics:

Declaration time (declared): When the incident was declared.
Detection time (detected): When the underlying resource from which the incident was declared was created. For example, if a monitor alert fires at 2 p.m. and the incident is declared at 2:30 p.m., the detected time is 2 p.m. If the incident wasn’t declared from another Datadog resource, detected is the same as declared.
Resolution time (resolved): When the incident was most recently resolved.

Measures

Incident Management reports the following analytic measures, which you can use to power analytic queries in Dashboard and Notebook widgets:

Customer Impact Duration: The duration during which customers were impacted, based on the impacts defined on the incident.
Status Active Duration: The duration that the incident was in an “active” state, based on the incident timeline.
Status Stable Duration: The duration that the incident was in a “stable” state, based on the incident timeline.
Time to Detect: The duration from the earliest customer impact to the incident’s detection time.
Time to Repair: The duration from the incident’s detection time to the last customer impact.
Time to Resolve: The duration from the incident’s declaration time to the time it was resolved.

In addition to these defaults, you can create new measures by adding custom Number property fields in your Incident Settings.

Timestamp overrides

Incident responders may forget to declare a Datadog incident before starting the response process. They may also forget to resolve an incident in Datadog even after the incident response process effectively ends. These oversights may paint a misleading picture in your incident analytics, permanently inflating your mean time to resolve or other measures.

To address this, organizations can enable timestamp overrides, which allow incident responders to manually override an incident’s recorded timestamps. When present, overrides affect the following analytic measures:

Time to Detect
Time to Repair
Time to Resolve

Overrides only influence search and analytics. They do not change the history automatically recorded to the incident timeline. They do not apply to the analytic measures Status Active Duration or Status Stable Duration, which are driven by the cumulative length of status segments on incident timelines.

To enable timestamp overrides, go to Incidents > Settings > Information.

To create, update, or delete timestamp overrides, users must have the Incidents Write permission.

Visualize incident data in dashboards

To configure your graph using Incident Management Analytics data, follow these steps:

Select your visualization.
Select Incidents from the data source dropdown menu.
Select a measure from the yellow dropdown menu.
- Default Statistic: Counts the number of incidents.
Select an aggregation for the measure.
(Optional) Select a rollup for the measure.
(Optional) Use the search bar to filter the statistic down to a specific subset of incidents.
(Optional) Select a facet in the pink dropdown menu to break the measure up by group and select a limited number of groups to display.
Title the graph.
Save your widget.

Example: Weekly outage customer impact duration grouped by service

Timeseries graph configuration showing Incidents data source filtered by severity, showing the customer impact duration grouped by service

This example configuration shows you an aggregation of your incidents that are SEV-1 or SEV-2. The graph displays the Customer Impact Duration of those incidents grouped by service.

Widget: Timeseries Line Graph
Datasource: Incidents
Measure: Customer Impact Duration
Aggregation: avg
Rollup: 1w
Filter: severity:("SEV-1" OR "SEV-2")
Group: Services, limit to top 5

Incident report

Use the out-of-the-box Notebook template to create the Incident Report or build one from scratch to get a summary report of incidents in your team or service.

Open the Incident Report template.
Click Use Template to edit and customize.
You can use the existing Incident cells or customize the query to display values for each measure.
Update the summary cells with the relevant values and share the report with the rest of your team.