Getting Started with Incident Management

Getting Started with Incident Management

Incident Management is not available on the Datadog for Government site.

Overview

Datadog Incident Management is for tracking and communicating about an issue you’ve identified with your metrics, traces, or logs.

This guide walks you through using the Datadog app for declaring an incident, updating the incident as investigation and remediation progresses, and generating a postmortem when the incident has been resolved. The example assumes the Slack integration is enabled.

Walking through an incident from issue detection to resolution

Declaring an incident

Scenario: A monitor is alerting on a high number of errors which may be slowing down several services. It’s unclear whether customers are being impacted.

This guide describes using the Datadog Clipboard to declare an incident.

  1. Open the Clipboard: Ctrl/Cmd + Shift + K.

    Using the Clipboard, you can gather information from different sources, such as graphs, monitors, entire dashboards, or notebooks. This helps you provide as much information as possible when declaring an incident.

    For this guide, choose a graph from the System - Metrics dashboard to copy to the Clipboard.

  2. In the Datadog menu on the left-hand side, go to Dashboard > Dashboard lists and select System - Metrics.

  3. Hover over one of the graphs and copy it to the Clipboard:

    a. Ctrl/Cmd + C

    or

    b. Click the Export icon on the graph and select Copy.

  4. In the Datadog menu on the left-hand side, go to Monitors > Manage Monitors and select [Auto] Clock in sync with NTP.

  5. Click Add current page to add the monitor to the Clipboard.

  1. Click Select All and then Add Selected Items To…

  2. Select New Incident.

  3. Describe what’s happening:

    SeveritySet to Unknown since it’s unclear whether customers are being impacted and how related services are being impacted. See the in-app description of what each severity level means and follow your team’s guidelines.
    TitleFollow any naming conventions your team wants to use for incident titles. Because this is not a real incident, include the word TEST to make it clear that this is a test incident. An example title: [TEST] My incident test
    SignalsSignals are the reason you are declaring the incident. These can be graphs, logs, or other key visuals. The graph and the monitor you selected is already included but you can add additional signals. For example, copy the URL for this guide and add it using Ctrl/Cmd + V.
    Incident CommanderLeave this assigned to you. In an actual incident this would be assigned to the leader of the incident investigation. You or others can update who the incident commander is as the incident investigation progresses.
    Additional NotificationsLeave blank because this is only a test, and you don’t want to alert anyone else or another service. In an actual incident, you will add people and services that should be notified to help with the investigation and remediation. You can send these notifications to Slack and PagerDuty as well.
  4. Click Declare Incident to create the incident.

    You can also declare an incident from a graph, monitor, or the incidents API. For APM users, you can click the Siren icon on any APM graph to declare an incident.

    As part of the Slack integration, you can also use the /datadog incident shortcut to declare an incident and set the title, severity, and customer impact.

    After the incident has been created, you can add additional notifications by clicking on the Notify button in the top right corner.

  5. Click Open Slack Channel on the top left of the incident’s page to go to the incident’s Slack channel.

    A new Slack channel dedicated to the incident is automatically created for any new incident, so that you can consolidate communication with your team and begin troubleshooting. If your organization’s Slack integration is set up to update a global incident channel, then the channel is updated with the new incident.

    In this example, you are the only one added to the new incident channel. When you add additional people or services in Additional Notifications for an actual incident, everyone will be automatically added to the incident channel.

    If you don’t have the Slack integration enabled, click Link to Chat to add the link to the chat service you are using to discuss the incident.

    You can also use Link Video Call to add a link to the call where discussions about the incident are happening.

Troubleshooting and updating the incident

The Incident page has four main sections: Overview, Timeline, Remediation, and Communication. Update these sections as the incident progresses to keep everyone informed of the current status.

Overview

Scenario: After some investigation, you discover that the root cause is a host running out of memory. You’ve also been informed that a small subset of customers are being affected and seeing slow loading of pages. The first customer report came in 15 minutes ago. It is a SEV-3 incident.

In the Overview section, you can update incident fields and customer impact as the investigation continues.

To update the severity level and root cause:

  1. Click the Overview tab.

  2. Click Edit in the Properties box.

  3. Click the Severity dropdown and select SEV-3.

  4. Add to the Root Cause field: TEST: Host is running out of memory.

  5. Select Monitor in the Detection dropdown, because you were first alerted by a monitor on the issue.

  6. Click Save to update the properties.

    From Slack, you can also update the title, severity, or status of an ongoing issue using the /datadog incident update command.

To update the customer impact:

  1. Click Edit in the Impact box.

  2. Select Yes in the Customer impact dropdown.

  3. Change the timestamp to 15 minutes earlier, because that was when the first customer report came in.

  4. Add to Scope of impact: TEST: Some customers seeing pages loading slowly.

  5. Click Save to update the fields.

    The top of the incident page shows how long the customer impact has been going on. All changes made on the Overview page are added to the Timeline.

Timeline

The Timeline shows additions and changes to incident fields and information in chronological order.

  1. Click the Timeline tab.

    The Content Type, Important, and Responder filters allow you to show specific types of events.

  2. Find the Customer impact updated event and mark as Important by clicking the flag icon.

    You can mark any event as Important so that when you generate a postmortem after the incident has been resolved, you can choose to include only timeline events that are marked as Important.

  3. Add a note to the timeline: I found the host causing the issue.

  4. Hover over the note’s event and click the pencil icon to change the timestamp of the note because you actually found the host causing the issue 10 minutes ago.

  5. Mark the note as Important.

  6. Click Open Slack Channel to go back to the incident’s Slack channel.

  7. Post a message in the channel saying I am working on a fix.

  8. Click the message’s actions command icon (three dots on the right after hovering over a message).

  9. Select Add to Incident to send the message to the timeline.

    You can add any Slack comment in the incident channel to the timeline so that you can easily consolidate important communications related to the investigation and mitigation of the incident.

Remediation

Scenario: There’s a notebook on how to handle this kind of issue, which includes tasks that need to be done to fix it.

In the Remediation section, you can keep track of documents and tasks for investigating the issue or for post-incident remediation tasks.

  1. Click the Remediation tab.

  2. Click the the plus icon (+) in the Documents box and add a link to a Datadog notebook.

    All additions and updates to the Documents section are added to the timeline as an Incident Update type.

  3. Add a task by adding a description of a task in the Incident Tasks box, for example: Run the steps in the notebook.

  4. Click Create Task.

  5. Click Assign To and assign yourself the task.

  6. Click Set Due Date and set the date for today.

    All task additions and changes are recorded in the Timeline.

    You can also add post-incident tasks in the Remediation section to keep track of them.

Communications

Scenario: The issue has been mitigated, and the team is monitoring the situation. The incident status is now stable.

In the Communications section, you can send out a notification updating the status of the incident.

  1. Navigate back to the Overview section.

  2. Click Edit in the Properties box and change the status to stable.

  3. Click Save.

  4. Go to the Communications tab.

  5. Click New Communication.

    The default message has the incident’s title in the subject and information about the current status of the incident in the body.

    In an actual incident you would send updates to the people involved in the incident. For this example, you will send a notification to yourself only.

  6. Add yourself to Add recipients.

  7. Click Send.

    You should receive an email with the message.

    You can create customized templates by clicking on Manage Templates > New Template. Group templates together using the Category field.

Resolution and postmortem

Scenario: It’s been confirmed that the issue no longer impacts customers and that you’ve resolved the issue. The team wants a postmortem to look back on what went wrong.

  1. Go to the Overview section.

  2. Click Edit in the Impact box to update the customer impact.

  3. Toggle the Active switch so that it’s no longer active.

    You can also change the date and time for when the customer impact ended if it occurred earlier.

  4. Click Edit in the Properties box to update the status of the incident.

  5. Change the status to resolved.

  6. Click Save.

    When an incident’s status is set to resolved, a Generate Postmortem button appears at the top.

  7. Click Generate Postmortem.

  8. For the timeline section, select Marked as Important so that only the Important events are added to the postmortem.

  9. Click Generate.

    The postmortem will be generated as a Datadog Notebook, and it includes the timeline events and resources referenced during the investigation and remediation. This makes it easier to review and further document what caused the issue and how to prevent it in the future. Datadog Notebook supports live collaboration so you can edit it with your teammates in real-time.

    If there are follow-up tasks that you and your team need to complete to ensure the issue doesn’t happen again, add those and track them in the Remediation’s Incident Tasks section.

Customizing your incident management workflow

Datadog Incident Management can be customized with different severity and status levels, based on your organization’s needs, and also include additional information such as APM services and teams related to the incident. For more information, see this section of the Incident Management page.

You can also set up notification rules to automatically notify specific people or services based on an incident’s severity level. For more information, see the Notification Rules documentation.

To customize Incident Management, go to the incident settings page. From the Datadog menu on the left-hand side, go to Monitors > Incidents (if you get an Incident Management welcome screen, click Get Started). Then on the top right corner, click Settings.

Further Reading