Investigate issues

This product is not supported for your selected Datadog site. ().
Cette page n'est pas encore disponible en français, sa traduction est en cours.
Si vous avez des questions ou des retours sur notre projet de traduction actuel, n'hésitez pas à nous contacter.

Start a Bits AI SRE investigation

You can start a Bits AI SRE investigation from:

Manually start an investigation

Monitor alerts

You can invoke Bits on an individual monitor alert or warn event from several entry points:

Option 1: Bits AI SRE Monitors list
  1. Go to Bits AI SRE > Monitors > Supported.
  2. Click Investigate Recent Alerts and select an alert.
Option 2: Monitor status page

Navigate to the monitor status page of a Bits AI SRE-supported monitor and click Investigate with Bits AI SRE in the top-right corner.

Option 3: Monitor event side panel

In the monitor event side panel of a Bits AI SRE-supported monitor, click Investigate with Bits AI SRE.

Option 4: Slack

To use the Slack integration, connect your Slack workspace to Bits AI SRE.

In Slack, reply to a monitor notification with @Datadog Investigate this alert.

APM latency

Join the Preview!

Bits AI SRE investigations from APM latency graphs and APM Watchdog stories are in Preview.

APM latency graphs on service pages
  1. In Datadog, navigate to APM and open the service or resource page you want to investigate. Next to the latency graph, click Investigate.
  2. Click and drag your cursor over the point plot visualization to make a rectangular selection over a region that shows unusual latency to seed the analysis. Initial diagnostics on the latency issue appear, including the observed user impact, anomalous tags contributing to the issue, and recent changes. For more information, see APM Investigator.
  3. Click Investigate with Bits AI SRE to run a deeper investigation.
APM latency Watchdog stories

On a Watchdog APM latency story, click Investigate with Bits AI SRE.

Enable automatic investigations

In addition to manual investigations, you can configure Bits to run automatically when a monitor transitions to the alert state:

From the Bits AI SRE Monitors list

  1. Go to Bits AI SRE > Monitors > Supported.
  2. Toggle Auto-Investigate on for a single monitor, or bulk-edit multiple monitors by selecting multiple monitors, then clicking Auto-Investigate All.

For a single monitor

  1. Open the monitor’s status page and click Edit.
  2. Scroll to Configure notifications & automations and toggle Investigate with Bits AI SRE.
  • Enabling automatic investigations using the Datadog API or Terraform is not supported.
  • An investigation initiates when a monitor transitions to the alert state.
  • Transitions to the warn or no data state, renotifications, and test notifications do not trigger automatic investigations.

Supported monitors

Bits is able to run investigations on the following monitor types:

  • Metric
  • Anomaly
  • Forecast
  • Integration
  • Outlier
  • Logs
  • APM (APM Metrics type only; Trace Analytics is not supported)
  • Synthetics (API tests only)

Best practices: Add investigation context to your monitors

Think of onboarding Bits as you would a new teammate: the more context you provide, the better it can investigate.

  • Include Datadog telemetry links: Add at least one helpful telemetry link in the monitor message. Think about the first place you’d normally look in Datadog when this monitor triggers. It could be a link to any of the following:

    • Datadog dashboard
    • Logs
    • Traces
    • Datadog notebook with helpful widgets
    • Confluence runbook page containing Datadog telemetry links (requires a configured Confluence integration)

    Bits uses these links during the Runbook steps of the initial investigation to identify potential problem areas. Because these links are user-defined, you have control over what Bits reviews, ensuring it focuses on the same data you would, and giving you the flexibility to tailor investigations to your team’s workflows. You don’t have to format the links in any particular way; plain links work.

  • Add service scoping: For monitors associated with a service, add a service tag to the monitor, or filter or group the monitor query by service.

    Example monitor with optimization steps applied

For additional suggestions on how to optimize investigations, see Help Bits learn.

How Bits AI SRE investigates

Investigations happen in two phases:

  1. Bits begins by gathering initial context on the problem and any information that might help it troubleshoot further. Depending on the starting point of the investigation, you may see one or more of the following types of step:

    • Runbook: If the starting point is a monitor alert, Bits begins by parsing Datadog or Confluence links that you have added to the monitor’s message, and uses them as entry points into the investigation.
    • Memory: If you have previously interacted with an investigation for the same monitor, Bits recalls any memories associated with the monitor to inform and accelerate the current investigation.
    • General search: Bits automatically scans your Datadog environment to gather additional context about what’s happening around the alert.
    • Trace Analysis: If the starting point is an APM latency graph, Bits automatically inspects anomalous traces to identify the specific services or tags contributing to latency hotspots.
    Flowchart showing Bits AI SRE combining runbook, memory, and general search into initial findings
  2. Using the collected context, Bits builds multiple root cause hypotheses and tests them concurrently. Bits looks at the following data sources:

    • Metrics
    • Traces
    • Logs
    • Dashboards
    • Change events
    • Kubernetes events Each hypothesis ends in one of three states: validated, invalidated, or inconclusive. When a hypothesis is validated, Bits generates sub-hypotheses and repeats the same investigative process on them.
    Flowchart showing the hypotheses Bits AI SRE built and tested

Reports

The Reports tab enables you to track the number of investigations run over time by monitor, user, service, and team. You can also track the mean time to initial findings and conclusion to assess the impact of Bits on your on-call efficiency.