Incident Management

Join an enablement webinar session

Explore and register for Foundation Enablement sessions. Learn how Datadog Incident Management enables DevOps teams and SREs to more effectively manage their incident response workflows from start to finish, saving time and reducing frustration when it matters most.

SIGN UP

Datadog Incident Management helps your team members identify, mitigate, and analyze disruptions and threats to your organization’s services. With Incident Management, you can design an automation-enhanced response process that helps your teams assemble around a shared framework and toolkit. You can also use incident analytics to evaluate the effectiveness of your incident response process.

Incidents live in Datadog alongside your metrics, traces, and logs. Your teams can declare incidents from monitor alerts, security signals, events, cases, and more. You can also configure monitors to declare incidents automatically.

Get Started

Incident Management requires no installation. Get started by taking a Learning Center course, reading our guided walkthrough, or declaring an incident.


Billing

Incident Management is a seat-based SKU. To learn more about how Incident Management is billed and how to manage seats within Datadog, visit our pricing page and the Incident Response billing documentation.

View and search for incidents

To view your incidents, go to the Incidents page to see a feed of all ongoing incidents. You can filter your incidents through the properties listed on the left, export your search results, and configure additional fields that appear for all incidents in Incident Settings.

Search examples

Incident search uses the same event-based search syntax as Logs and Event Management. Combine key:value pairs with Boolean operators (AND, OR, -) to filter incidents.

QueryDescription
severity:SEV-1Show all SEV-1 incidents
severity:(SEV-1 OR SEV-2) state:activeShow all active SEV-1 or SEV-2 incidents
services:checkout AND -state:resolvedShow unresolved incidents affecting the checkout service
teams:platformShow incidents assigned to the platform team
services:web*Show incidents affecting services starting with “web”
Root\ Cause\ Category:BugShow incidents with a specific root cause attribute
responder:john.smith@datadoghq.comShow incidents where John Smith is a responder

Filter and export

  • Filter by properties: Use the facet panel on the left to filter by Status, Severity, Time To Repair (hours), and other configured properties.
  • Export search results: Export your search results using the Export button at the top of the incident list.
  • Save views: Save your frequently used search queries and filters for quick access.

Mobile access

You can also view your Incidents list from your mobile device home screen and manage/create incidents by downloading the Datadog Mobile App, available on the Apple App Store and Google Play Store.

Two views in the Datadog Mobile App: one showing an incidents list with high-level details about each incident, and one showing a detailed panel for a single incident

Describing the incident

When declaring an incident, it is critical to provide a comprehensive description, detailing what happened, why it occurred, and related attributes to ensure all stakeholders in the incident management process are fully informed. The essential elements of an incident declaration include a title, severity level, and incident commanders. Effective incident management documentation includes:

  • Updating incident details, including its status, impact, root cause, detection methods, and service impacts.
  • Forming and managing a response team, using custom responder roles, and leveraging metadata attributes for detailed incident assessment.
  • Configuring notifications to keep all stakeholders informed throughout the incident resolution process.

For more information, see the Describe an Incident documentation.

Evaluate incident data

Incident Analytics provides insights into the efficiency and performance of your incident response process by allowing you to aggregate and analyze statistics from past incidents. Key metrics, such as time to resolution and customer impact, can be tracked over time. You can query these analytics using graph widgets in dashboards and notebooks. Datadog offers customizable templates, such as the Incident Management Overview Dashboard and a Notebook Incident Report, to help you get started.

For more details on the measures collected and step-by-step graph configurations to visualize your data, see Incident Management Analytics.

Integrations

Incident Management integrates closely with other Datadog products, including:

Third-party integrations

Incident Management integrates with third-party applications, including:

  • Atlassian Statuspage to create and update Statuspage incidents.
  • Confluence to generate incident postmortems.
  • CoScreen to launch collaborative meetings with multi-user screen sharing, remote control, and built-in audio and video chat.
  • CoTerm to follow terminal-based incident remediation activities in real time.
  • Jira to create a Jira ticket for an incident.
  • Microsoft Teams to create channels and video meetings for incidents.
  • PagerDuty and OpsGenie to page your on-call engineers and auto-resolve pages upon incident resolution.
  • ServiceNow to create a ServiceNow tickets for incidents.
  • Slack to create channels for incidents.
  • Webhooks to send incident notifications using webhooks (for example, sending SMS to Twilio).
  • Zoom to launch video calls for incidents.

Further Reading