---
title: Agent Observability MCP and Skills
description: >-
  Connect AI agents to your Agent Observability traces and experiments using the
  Datadog MCP Server.
breadcrumbs: Docs > Agent Observability > Agent Observability MCP and Skills
---

# Agent Observability MCP and Skills

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com, us2.ddog-gov.com

{% alert level="danger" %}
This product is not supported for your selected [Datadog site](https://docs.datadoghq.com/getting_started/site.md). ({% placeholder "user-datadog-site-name" /%}).
{% /alert %}

{% /callout %}

## Overview{% #overview %}

The [Datadog MCP Server](https://docs.datadoghq.com/mcp_server/setup.md) enables AI agents to access your [Agent Observability](https://docs.datadoghq.com/llm_observability.md) data through the Model Context Protocol (MCP). The `llmobs` toolset provides tools for searching and analyzing traces, inspecting span details and content, and evaluating experiment results directly from AI-powered clients like Cursor, Claude Code, or OpenAI Codex.

## Setup{% #setup %}

Connect an MCP-compatible client to the Datadog MCP Server with the `llmobs` toolset enabled.

{% alert level="info" %}
For full setup instructions, including Cursor and VS Code extension configuration, see [Set up the Datadog MCP Server](https://docs.datadoghq.com/mcp_server/setup.md).
{% /alert %}

### Prerequisites{% #prerequisites %}

- A Datadog account with permission to access Agent Observability data.
- An MCP-compatible client (for example, Claude Code, Codex CLI, Cursor, Gemini CLI, or Kiro CLI).

### Endpoint{% #endpoint %}

The MCP Server endpoint depends on your [Datadog site](https://docs.datadoghq.com/getting_started/site.md). Use the Datadog Site selector to display the endpoint for your site. Append `?toolsets=llmobs,core` to enable the Agent Observability and core toolsets.

{% callout %}
# Important note for users on the following Datadog sites: app.datadoghq.com, us3.datadoghq.com, us5.datadoghq.com, app.datadoghq.eu, ap1.datadoghq.com, ap2.datadoghq.com, uk1.datadoghq.com



Endpoint for your selected site ({% placeholder "user-datadog-site-name" /%}):

```
<YOUR_MCP_SERVER_ENDPOINT>?toolsets=llmobs,core
```


{% /callout %}

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com, us2.ddog-gov.com



{% alert level="danger" %}
This product is not supported for your selected site ({% placeholder "user-datadog-site-name" /%}).
{% /alert %}


{% /callout %}

### Connect{% #connect %}

Choose remote authentication when possible. Use local binary authentication if your environment blocks the remote OAuth flow.

{% tab title="Remote authentication" %}

{% callout %}
# Important note for users on the following Datadog sites: app.datadoghq.com, us3.datadoghq.com, us5.datadoghq.com, app.datadoghq.eu, ap1.datadoghq.com, ap2.datadoghq.com, uk1.datadoghq.com



Remote authentication uses the MCP specification's [Streamable HTTP](https://modelcontextprotocol.io/specification/2025-03-26/basic/transports#streamable-http) transport.

**Claude Code** (command line):

```
claude mcp add --transport http datadog-mcp "<YOUR_MCP_SERVER_ENDPOINT>?toolsets=llmobs,core"
```

**Codex CLI** (`~/.codex/config.toml`):

```
[mcp_servers.datadog]
url = "<YOUR_MCP_SERVER_ENDPOINT>?toolsets=llmobs,core"
```

After adding the configuration, run `codex mcp login datadog` to complete the OAuth flow.

**Gemini CLI, Kiro CLI, and other MCP-compatible clients**:

```
{
  "mcpServers": {
    "datadog": {
      "type": "http",
      "url": "<YOUR_MCP_SERVER_ENDPOINT>?toolsets=llmobs,core"
    }
  }
}
```


{% /callout %}

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com, us2.ddog-gov.com



{% alert level="danger" %}
This product is not supported for your selected site ({% placeholder "user-datadog-site-name" /%}).
{% /alert %}


{% /callout %}

{% /tab %}

{% tab title="Local binary authentication" %}
Local binary authentication uses the MCP specification's [stdio](https://modelcontextprotocol.io/specification/2025-03-26/basic/transports#stdio) transport. Use this method if remote authentication is unavailable.

1. Install the Datadog MCP Server binary:

   ```bash
   curl -sSL https://coterm.datadoghq.com/mcp-cli/install.sh | bash
   ```

The binary installs to `~/.local/bin/datadog_mcp_cli`.

1. Complete the OAuth login flow:

   ```bash
   datadog_mcp_cli login
   ```

1. Configure your AI client. For Claude Code, add the following to `~/.claude.json`, replacing `<USERNAME>` in the command path:

   ```json
   {
     "mcpServers": {
       "datadog": {
         "type": "stdio",
         "command": "/Users/<USERNAME>/.local/bin/datadog_mcp_cli",
         "args": [],
         "env": {}
       }
     }
   }
   ```

Alternatively, add the server with the Claude Code CLI:

   ```bash
   claude mcp add datadog --scope user -- ~/.local/bin/datadog_mcp_cli
   ```

{% /tab %}

### Authenticate with API keys{% #authenticate-with-api-keys %}

The MCP Server uses OAuth 2.0 by default. If OAuth is unavailable, send a Datadog [API key and application key](https://docs.datadoghq.com/account_management/api-app-keys.md) as the `DD_API_KEY` and `DD_APPLICATION_KEY` HTTP headers:

{% callout %}
# Important note for users on the following Datadog sites: app.datadoghq.com, us3.datadoghq.com, us5.datadoghq.com, app.datadoghq.eu, ap1.datadoghq.com, ap2.datadoghq.com, uk1.datadoghq.com



```
{
  "mcpServers": {
    "datadog": {
      "type": "http",
      "url": "<YOUR_MCP_SERVER_ENDPOINT>?toolsets=llmobs,core",
      "headers": {
          "DD_API_KEY": "<YOUR_API_KEY>",
          "DD_APPLICATION_KEY": "<YOUR_APPLICATION_KEY>"
      }
    }
  }
}
```


{% /callout %}

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com, us2.ddog-gov.com



{% alert level="danger" %}
This product is not supported for your selected site ({% placeholder "user-datadog-site-name" /%}).
{% /alert %}


{% /callout %}

For security, scope the API key and application key to a [service account](https://docs.datadoghq.com/account_management/org_settings/service_accounts.md) with only the required permissions.

## Agent skills{% #agent-skills %}

Agent skills are prebuilt instruction sets for AI coding agents that automate common Agent Observability workflows. The `agent-observability` skill set is available in the [Datadog agent-skills](https://github.com/datadog-labs/agent-skills) repository. It provides six skills for classifying sessions, diagnosing failures, analyzing experiments, generating experiment code with the `ddtrace.llmobs` SDK, and bootstrapping evaluators against your live production data.

### Install{% #install %}

Install the `agent-observability` skills with the following command:

```shell
npx skills add datadog-labs/agent-skills --skill agent-observability --full-depth -y
```

The skills require the `llmobs` MCP toolset to be connected. If you have not already connected it, run:

```shell
claude mcp add --scope user --transport http "datadog-llmo-mcp" \
  'https://mcp.datadoghq.com/v1/mcp?toolsets=llmobs'
```

Restart Claude Code after running both commands for the skills to appear.

### Available skills{% #available-skills %}

| Skill                     | Invoke with                                    | What it does                                                                                                                                                                                                           |
| ------------------------- | ---------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Session classify          | `/agent-observability-session-classify`        | Classifies whether user intent was satisfied in a session, trace, or batch                                                                                                                                             |
| Trace RCA                 | `/agent-observability-trace-rca`               | Root cause analysis on failing production traces                                                                                                                                                                       |
| Experiment analyzer       | `/agent-observability-experiment-analyzer`     | Analyze and compare LLM experiment results                                                                                                                                                                             |
| Experiment Python codegen | `/agent-observability-experiment-py-bootstrap` | Generate Python experiment code using the `ddtrace.llmobs` SDK. Introspects your app to wire a real `task_fn`, auto-discovers `.env` credentials, and accepts a free-form `--purpose` that directs evaluator selection |
| Eval bootstrap            | `/agent-observability-eval-bootstrap`          | Generate evaluator code, publish online LLM-judge evaluators, or sample traces into a dataset for use in an experiment                                                                                                 |
| Eval pipeline             | `/agent-observability-eval-pipeline`           | Six-phase guided pipeline from production traces through evaluators, datasets, experiments, and analysis. Stop early with `--stop-after`, resume mid-flow with `--start-at`                                            |

#### Session classification{% #session-classification %}

`/agent-observability-session-classify` classifies whether user intent was satisfied in a given interaction. It draws from up to three signal sources: Agent Observability traces, RUM behavioral data, and Audit Trail events. The skill returns a `yes / partial / no` verdict with supporting evidence. Confidence improves with each additional signal source.

```
/agent-observability-session-classify session_id=<SESSION_ID>
/agent-observability-session-classify trace_id=<TRACE_ID>
/agent-observability-session-classify ml_app=my-chatbot --timeframe now-7d
```

#### Trace root cause analysis{% #trace-root-cause-analysis %}

`/agent-observability-trace-rca` diagnoses why an LLM application is producing poor results. It selects an analysis mode based on the strongest available signal (LLM-judge eval verdicts, runtime errors, or structural anomalies) and compiles a structured RCA report. The report includes a failure taxonomy and concrete `BEFORE` / `AFTER` fix proposals grounded in trace evidence.

When Claude Code has access to your codebase, the skill can search for the relevant source files and propose diffs inline.

```
/agent-observability-trace-rca ml_app=my-chatbot
/agent-observability-trace-rca ml_app=my-chatbot eval_name=faithfulness --timeframe now-24h
```

#### Evaluator bootstrap{% #evaluator-bootstrap %}

`/agent-observability-eval-bootstrap` analyzes production traces and proposes a suite of evaluators targeting the observed failure modes. It outputs one of four artifacts: Python `BaseEvaluator` / `LLMJudge` classes for offline experiments, a framework-agnostic JSON spec, online LLM-judge evaluators published directly to Datadog, or — via `--emit-dataset <path>` — a `DatasetRecordRaw[]` JSON sampled from production traces and shaped for `LLMObs.create_dataset(records=...)`. The dataset-emit mode skips the evaluator workflow entirely; it produces a dataset suitable for use as the input to an experiment.

```
/agent-observability-eval-bootstrap ml_app=my-chatbot
/agent-observability-eval-bootstrap ml_app=my-chatbot --publish
/agent-observability-eval-bootstrap ml_app=my-chatbot --data-only
/agent-observability-eval-bootstrap ml_app=my-chatbot --emit-dataset ./datasets/my_chatbot_seed.json
```

#### Experiment analyzer{% #experiment-analyzer %}

`/agent-observability-experiment-analyzer` retrieves experiment results and surfaces what changed between a candidate and a baseline: which metrics improved, which regressed, and where the candidate underperformed.

```
/agent-observability-experiment-analyzer experiment_id=<EXPERIMENT_ID>
/agent-observability-experiment-analyzer experiment_id=<CANDIDATE_ID> baseline_id=<BASELINE_ID>
```

#### Generate experiment code with the Python SDK{% #generate-experiment-code-with-the-python-sdk %}

`/agent-observability-experiment-py-bootstrap` emits a self-contained `.py` script or Jupyter `.ipynb` notebook that uses the `ddtrace.llmobs` SDK and matches the canonical reference notebook style.

The dataset can be a local `DatasetRecordRaw[]` JSON (inlined into the file), a CSV (loaded at runtime via `LLMObs.create_dataset_from_csv`), an existing Datadog dataset by name (`LLMObs.pull_dataset`), or — by default — a small inline 3-record sample. Every generated experiment is tagged with `generated_by=claude-code` and the resolved `--purpose` in both `config` and `tags`.

```gdscript3
/agent-observability-experiment-py-bootstrap --purpose "validate output accuracy"
/agent-observability-experiment-py-bootstrap --purpose "test tool selection" --dataset ./data/qa.json
/agent-observability-experiment-py-bootstrap --dataset-name <DATASET_NAME> --project-name <PROJECT_NAME>
/agent-observability-experiment-py-bootstrap --task-source mymodule.handlers:respond
```

#### End-to-end eval pipeline{% #end-to-end-eval-pipeline %}

`/agent-observability-eval-pipeline` walks from production traces through evaluators, datasets, experiments, and analysis in six narrated phases, with a user checkpoint between each:

1. **Classify ml\_app traces** — sample and classify recent traces from your `ml_app`
1. **Root cause analysis** — diagnose why failing traces are failing
1. **Bootstrap evaluators** — propose an evaluator suite targeting the observed failure modes
1. **Create + publish dataset** — extract input / expected_output pairs into a `DatasetRecordRaw[]` JSON and publish to Datadog under your project (created lazily)
1. **Generate + run experiment** — emit a runnable `.py` or `.ipynb` that pulls the dataset and wires your app's task function, then execute it end-to-end and capture `experiment.url`. An in-phase review beat (`run` / `edit` / `stop`) sits between codegen and execution so you can inspect the generated file before it runs
1. **Analyze experiment** — produce an analysis report with metric breakdowns and recommendations

Each phase has a canonical short name — the same value accepted by `--start-at` and `--stop-after`. The table below lists, per phase, which MCP tools the pipeline may invoke and a one-line description of the logic:

| \# | Phase title                 | Stage name       | MCP tools called                                                                                                                                                                 | Summary                                                                                                                                 |
| -- | --------------------------- | ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| 1  | Classify ml_app traces      | `classify`       | `search_llmobs_spans`                                                                                                                                                            | Samples recent root spans for the `ml_app`, classifies each as success / partial / failure, surfaces common patterns.                   |
| 2  | Root cause analysis         | `rca`            | `search_llmobs_spans`                                                                                                                                                            | Pulls full traces for failing spans from Phase 1 and walks the trace tree to attribute each failure to a root span and a failure mode.  |
| 3  | Bootstrap evaluators        | `eval-bootstrap` | None (local reasoning over the Phase 2 report); optional Datadog API call to publish online LLM-judge evaluators when `--publish` is set                                         | Emits a Python evaluator suite (`sdk_code`), a framework-agnostic JSON spec (`data_only`), or publishes online evaluators (`publish`).  |
| 4  | Create and publish dataset  | `dataset`        | `search_llmobs_spans` for sampling; `LLMObs.create_dataset()` via the ddtrace SDK (not MCP) for publish                                                                          | Samples root spans, extracts input / expected_output pairs, scrubs PII, writes a local JSON, then publishes to Datadog.                 |
| 5  | Generate and run experiment | `experiment`     | `list_llmobs_evals` (one-shot startup beacon — connectivity + telemetry); runtime uses the ddtrace SDK                                                                           | Introspects your app for LLM call sites, emits a self-contained `.py` or `.ipynb` wiring `task_fn` to a real entry point, then runs it. |
| 6  | Analyze experiment          | `analyze`        | `get_llmobs_experiment_summary`, `get_llmobs_experiment_metric_values`, `list_llmobs_experiment_events`, `get_llmobs_experiment_event`, `get_llmobs_experiment_dimension_values` | Pulls top-line metrics, per-record scores, segment dimensions, and drill-down events; synthesizes a structured analysis report.         |

You can `stop` cleanly at any checkpoint and resume later with `--start-at <stage-name>` — no re-running required. Pass `--stop-after eval-bootstrap` to preserve the classic three-phase eval-only behavior.

```
/agent-observability-eval-pipeline my-chatbot --project-name my-chatbot
/agent-observability-eval-pipeline my-chatbot --stop-after eval-bootstrap          # classic 3-phase
/agent-observability-eval-pipeline my-chatbot --start-at experiment                # resume mid-flow
/agent-observability-eval-pipeline my-chatbot --start-at analyze --experiment-id <UUID>
```

For a complete guide to these skills and a recommended end-to-end workflow, see [Analyze LLM Applications with Claude Code Skills](https://docs.datadoghq.com/llm_observability/guide/claude_code_skills.md).

## Use cases{% #use-cases %}

The Agent Observability MCP tools enable AI-assisted workflows for:

- **Debugging agent execution**: Search for traces by ML app, error status, or custom tags, then examine span hierarchies and content to identify failures.
- **Analyzing trace structure**: Visualize the full span tree of a trace to understand how agents, LLMs, tools, and retrievals interact.
- **Investigating agent loops**: Review an agent's step-by-step execution loop to understand decision-making and tool invocation patterns.
- **Evaluating experiments**: Get summary statistics for experiment metrics, compare results across dimension segments, and inspect individual events.
- **Discovering experiment patterns**: Filter and sort experiment events by metric performance to find the best and worst-performing cases.
- **Managing evaluators**: List, inspect, create, update, and delete evaluator configurations across an ML application or the entire organization.
- **Exploring Patterns**: List pattern configurations, check run status, and browse the discovered topic hierarchy to understand what users are asking and how traffic is distributed.

## Available tools{% #available-tools %}

The `llmobs` toolset includes the following tools:

### Trace and span tools{% #trace-and-span-tools %}

{% dl %}

{% dt %}
`search_llmobs_spans`
{% /dt %}

{% dd %}
Search for spans matching filters or a raw query.
{% /dd %}

{% dt %}
`get_llmobs_trace`
{% /dt %}

{% dd %}
Get the full structure of a trace as a span hierarchy tree, including span counts by kind, error indicators, and total duration.
{% /dd %}

{% dt %}
`get_llmobs_span_details`
{% /dt %}

{% dd %}
Get detailed metadata for one or more spans, including timing, error info, LLM details (model, token counts), metrics, and evaluations.
{% /dd %}

{% dt %}
`get_llmobs_span_content`
{% /dt %}

{% dd %}
Retrieve the actual content of a span field (input, output, messages, documents, or metadata) with optional JSONPath extraction.
{% /dd %}

{% dt %}
`find_llmobs_error_spans`
{% /dt %}

{% dd %}
Find all error spans in a trace with propagation context, grouped by span kind with error messages and stack traces.
{% /dd %}

{% dt %}
`expand_llmobs_spans`
{% /dt %}

{% dd %}
Load children of specific spans for progressive tree exploration when `get_llmobs_trace` returns collapsed nodes.
{% /dd %}

{% dt %}
`get_llmobs_agent_loop`
{% /dt %}

{% dd %}
Get a chronological view of an agent's execution loop, showing each step (LLM calls, tool invocations, decisions) in order.
{% /dd %}

{% /dl %}

### Experiment tools{% #experiment-tools %}

{% dl %}

{% dt %}
`get_llmobs_experiment_summary`
{% /dt %}

{% dd %}
Get a high-level experiment summary with pre-computed statistics for all evaluation metrics. Start here before using other experiment tools.
{% /dd %}

{% dt %}
`list_llmobs_experiment_events`
{% /dt %}

{% dd %}
List experiment events with filtering by dimension or metric and sorting by metric value.
{% /dd %}

{% dt %}
`get_llmobs_experiment_event`
{% /dt %}

{% dd %}
Get full details for a single experiment event, including input, output, expected output, all metrics, and dimensions.
{% /dd %}

{% dt %}
`get_llmobs_experiment_metric_values`
{% /dt %}

{% dd %}
Get statistical analysis for a specific evaluation metric, optionally segmented by a dimension for comparison.
{% /dd %}

{% dt %}
`get_llmobs_experiment_dimension_values`
{% /dt %}

{% dd %}
Get unique values for a dimension with counts, useful for discovering valid filter and segment values.
{% /dd %}

{% /dl %}

### Evaluator tools{% #evaluator-tools %}

{% dl %}

{% dt %}
`list_llmobs_evals`
{% /dt %}

{% dd %}
List every LLM-judge evaluator configured across all ML applications. Returns each evaluator's name, ml_app, and enabled status.
{% /dd %}

{% dt %}
`list_llmobs_evals_by_ml_app`
{% /dt %}

{% dd %}
List all LLM-judge evaluators configured for a specific ML application.
{% /dd %}

{% dt %}
`get_llmobs_evaluator`
{% /dt %}

{% dd %}
Retrieve an LLM-judge evaluator configuration by name, including its target (ml_app, sampling, filter), LLM provider, and judge prompt template.
{% /dd %}

{% dt %}
`create_or_update_llmobs_evaluator`
{% /dt %}

{% dd %}
Create or update an LLM-judge evaluator configuration. Targets a specific ML application and optionally a filter or sampling percentage; the judge's model and prompt template define how it scores each span.
{% /dd %}

{% dt %}
`delete_llmobs_evaluator`
{% /dt %}

{% dd %}
Delete an LLM-judge evaluator configuration by name.
{% /dd %}

{% /dl %}

### Patterns tools{% #patterns-tools %}

{% dl %}

{% dt %}
`list_llmobs_pattern_configs`
{% /dt %}

{% dd %}
List all Patterns configurations for the org. Returns each config's `id`, `name`, `evp_query`, sampling settings, and timestamps. Start here to find a `config_id`.
{% /dd %}

{% dt %}
`get_llmobs_pattern_config`
{% /dt %}

{% dd %}
Get the most-recently-modified Patterns configuration for the org.
{% /dd %}

{% dt %}
`get_llmobs_pattern_run_status`
{% /dt %}

{% dd %}
Get the status and per-activity progress of the most recent Patterns run for a config. Use this to check whether clustering is running, completed, or failed before reading topics.
{% /dd %}

{% dt %}
`list_llmobs_pattern_runs`
{% /dt %}

{% dd %}
List all completed Patterns runs for a config, newest first. Returns each run's `id`, `status`, timestamps, and the `config_snapshot` used.
{% /dd %}

{% dt %}
`get_llmobs_patterns`
{% /dt %}

{% dd %}
Get the topic hierarchy discovered by a Patterns run. Topics are organized into levels, each with a `name`, `description`, and `point_count`. Omit `run_id` to read the most recent completed run.
{% /dd %}

{% dt %}
`get_llmobs_patterns_with_points`
{% /dt %}

{% dd %}
Get the topic hierarchy for a run with span IDs inlined on each leaf topic. Set `include_metrics=true` to also include per-span duration, cost, token counts, and evaluations.
{% /dd %}

{% dt %}
`get_llmobs_pattern_points`
{% /dt %}

{% dd %}
Get a cursor-paginated page of clustering points (individual spans) assigned to a single topic. Each point includes the `span_id`, `session_id`, and a span input preview. Pass `next_page_token` back as `page_token` to continue paging.
{% /dd %}

{% /dl %}

## Recommended workflows{% #recommended-workflows %}

### Trace analysis{% #trace-analysis %}

1. **Search**: Use `search_llmobs_spans` to find traces by ML app, status, span kind, or custom tags.
1. **Visualize**: Use `get_llmobs_trace` to see the full span hierarchy tree.
1. **Inspect**: Use `get_llmobs_span_details` to get metadata, timing, and evaluations for specific spans.
1. **Read content**: Use `get_llmobs_span_content` to retrieve the actual I/O, messages, or documents.
1. **Debug errors**: Use `find_llmobs_error_spans` to locate all errors in a trace with propagation context.
1. **Expand**: Use `expand_llmobs_spans` to load children of collapsed spans for deeper exploration.
1. **Agent review**: Use `get_llmobs_agent_loop` to see the step-by-step execution flow of an agent span.

### Experiment analysis{% #experiment-analysis %}

1. **Summarize**: Use `get_llmobs_experiment_summary` to get overall statistics and discover available metrics and dimensions.
1. **Browse events**: Use `list_llmobs_experiment_events` to find events of interest, filtering by dimension or sorting by metric.
1. **Inspect events**: Use `get_llmobs_experiment_event` to view full details for a specific event.
1. **Analyze metrics**: Use `get_llmobs_experiment_metric_values` to get percentile distributions, true/false rates, or compare across dimension segments.
1. **Discover dimensions**: Use `get_llmobs_experiment_dimension_values` to find valid filter and segment values.

### Patterns analysis{% #patterns-analysis %}

1. **List configs**: Use `list_llmobs_pattern_configs` to find available Patterns configurations and their `config_id` values.
1. **Check run status**: Use `get_llmobs_pattern_run_status` to verify the most recent run is complete.
1. **Read topics**: Use `get_llmobs_patterns` to get the full topic hierarchy with names, descriptions, and coherence scores.
1. **Inspect spans**: Use `get_llmobs_patterns_with_points` to get topics with span IDs inlined, or `get_llmobs_pattern_points` to page through the spans of a specific topic.
1. **Analyze span content**: Use `get_llmobs_span_details` or `get_llmobs_span_content` with the `span_id` values from the previous step to inspect the actual inputs, outputs, and metadata of individual spans within a topic.
1. **Browse past runs**: Use `list_llmobs_pattern_runs` to see historical runs and pass a specific `run_id` to compare topic distributions over time.

## Example prompts{% #example-prompts %}

After connecting, try prompts like:

- Review error traces for my `customer-support-bot` app over the past week. Summarize the most common failure patterns, how often they occur, and recommend which ones to fix first.
- Find traces where my agent's responses were flagged by evaluations as low quality. Look at the inputs and outputs, then suggest specific changes to my system prompt to improve response quality.
- Look at recent agent traces for my app and find cases where the agent looped more than necessary. Analyze the decision-making at each step and suggest how to improve my tool descriptions to reduce unnecessary tool calls.
- A user reported a bad response. Here's the trace ID: `trace-123`. Walk me through exactly what happened: what the user asked, what the agent did at each step, and where things went wrong. Suggest a code fix.
- Analyze experiment `exp-456` and generate a markdown table of the worst-performing dimensions broken down by evaluation scores. Include any other relevant columns that help me understand where and why performance is degrading.
- Compare experiment `exp-123` (baseline) against experiment `exp-456`. Summarize what improved, what regressed, and by how much. Give me a recommendation on whether the changes are worth shipping.
- Summarize experiment `exp-456` and identify the top 5 lowest-scoring events. For each, show the input, output, and which evaluations failed.

## Combine with other Datadog tools{% #combine-with-other-datadog-tools %}

The `core` toolset included in the setup URL gives your AI agent access to additional Datadog tools that pair naturally with Agent Observability analysis.

### Export analysis to Datadog Notebooks{% #export-analysis-to-datadog-notebooks %}

The `core` toolset includes `create_datadog_notebook` and `edit_datadog_notebook`, which let your AI agent create [Datadog Notebooks](https://docs.datadoghq.com/notebooks.md) directly from analysis results. You can export findings from agent chats into a collaborative, shareable notebook that lives in Datadog alongside your traces and experiments.

Try prompts like:

- Analyze experiment `exp-456`, identify the worst-performing dimensions, and export a summary report to a Datadog Notebook with a breakdown by evaluation scores.
- Review error traces for my `customer-support-bot` over the past week and create a Datadog Notebook with the findings, including common failure patterns and recommended fixes.

For custom visualizations that go beyond standard Datadog widgets, like comparison charts or quadrant plots, Notebooks also render [Mermaid diagrams](https://docs.datadoghq.com/notebooks/guide/build_diagrams_with_mermaidjs.md) natively. Try prompts like:

- Analyze experiment `exp-456`, compare the `accuracy` scores across each prompt version, and export the results to a Datadog Notebook that includes a Mermaid bar chart of the average score for each version.
- Analyze experiment `exp-456` and export a Datadog Notebook that plots each prompt version on a Mermaid quadrant chart with `relevance` on one axis and `accuracy` on the other. Identify which versions are underperforming on both dimensions.

## Further reading{% #further-reading %}

- [Datadog MCP Server](https://docs.datadoghq.com/mcp_server.md)
- [Set up and use Agent Observability Experiments](https://docs.datadoghq.com/llm_observability/experiments.md)
- [Monitor your application with Agent Observability](https://docs.datadoghq.com/llm_observability/monitoring.md)
- [Analyze LLM Applications with Claude Code Skills](https://docs.datadoghq.com/llm_observability/guide/claude_code_skills.md)
