---
title: Analyze LLM Applications with Claude Code Skills
description: >-
  Use Datadog's Claude Code skills to classify sessions, diagnose failures,
  compare experiments, and bootstrap evaluators against your live production
  data.
breadcrumbs: >-
  Docs > LLM Observability > LLM Observability Guides > Analyze LLM Applications
  with Claude Code Skills
---

# Analyze LLM Applications with Claude Code Skills

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com, us2.ddog-gov.com

{% alert level="danger" %}
This product is not supported for your selected [Datadog site](https://docs.datadoghq.com/getting_started/site.md). ().
{% /alert %}

{% /callout %}

## Overview{% #overview %}

Datadog provides a set of [Claude Code](https://claude.ai/code) skills that bring LLM Observability analysis directly into your development workflow. Rather than navigating dashboards manually, you can invoke these skills from a Claude Code session to classify sessions, diagnose failures, compare experiments, and generate evaluators — all against your live production data.

| Skill                          | What it does                                                                                        |
| ------------------------------ | --------------------------------------------------------------------------------------------------- |
| `/llm-obs-session-classify`    | Classify whether user intent was satisfied in a session, trace, or batch of sessions from an ml_app |
| `/llm-obs-trace-rca`           | Root cause analysis on failing production LLM traces                                                |
| `/llm-obs-experiment-analyzer` | Analyze and compare LLM experiment results                                                          |
| `/llm-obs-eval-bootstrap`      | Generate evaluator code or publish online LLM-judge evaluators from trace data                      |
| `/llm-obs-eval-pipeline`       | End-to-end pipeline: classify sessions → root cause analysis → bootstrap evaluators                 |

The skills produce structured, actionable output — RCA reports with before/after fix proposals, generated evaluator code, experiment comparisons — that you can pass directly to a coding agent to apply fixes to your application. When Claude Code has access to your codebase, it can search for the relevant system prompt, tool definitions, or routing logic and propose specific diffs without leaving the session.

## Setup{% #setup %}

### Prerequisites{% #prerequisites %}

- [Claude Code](https://claude.ai/code) installed and authenticated
- At least one LLM application [instrumented with LLM Observability](https://docs.datadoghq.com/llm_observability/setup.md) and producing traces
- A data backend: either the Datadog MCP server **or** the `pup` CLI

### Datadog MCP server{% #datadog-mcp-server %}

To use the Datadog MCP server option, connect the LLM Observability MCP server to your Claude Code session:

```shell
claude mcp add --scope user --transport http datadog-llmo-mcp \
  'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
```

All skills detect the MCP server automatically at startup and use it throughout.

### Option B: pup CLI{% #option-b-pup-cli %}

If you prefer not to use the MCP server, the skills also run over [`pup`](https://datadoghq.atlassian.net/wiki/spaces/BITSAI/pages/5226692942/pup+CLI), Datadog's internal CLI. Install `pup` and authenticate:

```shell
pup auth login
```

Each skill detects at startup whether the MCP server is available; if not, it checks for `pup` and switches to pup mode automatically. You can also force pup mode explicitly by passing `--backend pup` to any skill invocation.

In pup mode, all Datadog API calls are made through `pup llm-obs` subcommands instead of MCP tools. The output and workflow are identical.

## Skills{% #skills %}

### Classify sessions and traces{% #classify-sessions-and-traces %}

`/llm-obs-session-classify` assesses whether a user's intent was satisfied in a given interaction. It operates in three modes depending on what you provide:

| Mode    | Invoke with  | Use when                                                      |
| ------- | ------------ | ------------------------------------------------------------- |
| Session | `session_id` | Evaluating a specific session                                 |
| Trace   | `trace_id`   | Evaluating a single LLM Observability trace                   |
| App     | `ml_app`     | Sampling and classifying a batch of recent sessions or traces |

The skill pulls from up to three signal sources, and accuracy improves the more data it has access to:

- **LLM Observability traces** — the full span tree, conversation content, tool call results, and eval judge verdicts. Always available.
- **RUM behavioral signals** — page views, custom actions, dwell time, and explicit feedback events that confirm or contradict what the trace shows. Available when RUM is instrumented for your app.
- **Audit Trail** — server-confirmed write events (dashboards created, monitors modified, notebooks deleted) that prove whether the assistant's actions actually landed. Most authoritative signal when the session involved asset creation or editing.

The skill returns a compact `yes / partial / no` verdict with a one-sentence reason by default. Add `verbose: true` for a full markdown report.

**Examples:**

```
/llm-obs-session-classify session_id=abc-123
/llm-obs-session-classify trace_id=def-456
/llm-obs-session-classify ml_app=my-chatbot --timeframe now-7d
```

### Diagnose failures with root cause analysis{% #diagnose-failures-with-root-cause-analysis %}

`/llm-obs-trace-rca` walks the span tree of failing traces to identify why your LLM application is producing poor results. It selects the best analysis mode based on available signals: LLM-judge eval verdicts (strongest signal), runtime errors, or structural anomalies such as latency outliers and agent-loop decisions.

The skill samples failing spans, groups them into a failure taxonomy, and compiles a structured RCA report with root cause categories, supporting evidence, and concrete fix proposals. Each fix includes the actual text or code from the trace — system prompt excerpts, tool argument shapes, routing logic — with a `BEFORE` / `AFTER` showing exactly what to change.

When Claude Code has access to your codebase, the skill searches for the relevant source files and proposes diffs you can apply immediately. For system prompt deficiencies, tool misuse, or routing errors, this means going from diagnosis to a pull request without leaving the session.

**Examples:**

```
/llm-obs-trace-rca ml_app=my-chatbot
/llm-obs-trace-rca ml_app=my-chatbot eval_name=faithfulness --timeframe now-24h
```

### Analyze and compare experiments{% #analyze-and-compare-experiments %}

`/llm-obs-experiment-analyzer` retrieves experiment results and surfaces what changed between a candidate and a baseline. It works for a single experiment (exploratory analysis) or a pair (comparative analysis).

The skill highlights which metrics improved or regressed, which event categories shifted, and where the candidate underperformed — so you can make a confident promotion decision.

**Examples**

```
/llm-obs-experiment-analyzer experiment_id=exp-123
/llm-obs-experiment-analyzer experiment_id=exp-456 baseline_id=exp-123
```

### Bootstrap evaluators from trace data{% #bootstrap-evaluators-from-trace-data %}

`/llm-obs-eval-bootstrap` analyzes production traces from an ml_app (or an RCA report already in context) and proposes a suite of evaluators that would catch the observed failure modes. It outputs one of three artifacts:

| Mode               | Flag          | Output                                                                                                                                          |
| ------------------ | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| SDK code (default) | —             | Python `BaseEvaluator` / `LLMJudge` classes ready to drop into an [LLM Experiment](https://docs.datadoghq.com/llm_observability/experiments.md) |
| JSON spec          | `--data-only` | Framework-agnostic evaluator spec, suitable for review or manual implementation                                                                 |
| Online judges      | `--publish`   | LLM-judge evaluators published directly to Datadog and enabled on your ml_app                                                                   |

The generated Python file is self-contained and ready to hand to a coding agent for integration into your experiment harness or CI pipeline.

**Examples**

```
/llm-obs-eval-bootstrap ml_app=my-chatbot
/llm-obs-eval-bootstrap ml_app=my-chatbot --publish
/llm-obs-eval-bootstrap ml_app=my-chatbot --data-only
```

### Run the end-to-end pipeline{% #run-the-end-to-end-pipeline %}

`/llm-obs-eval-pipeline` chains the three skills above into a single supervised workflow. It is the recommended starting point when you have no existing evaluators and want to go from raw production traces to a working evaluator suite.

```
Phase 1: llm-obs-session-classify  →  classify a sample of recent traces
Phase 2: llm-obs-trace-rca         →  root cause the failures
Phase 3: llm-obs-eval-bootstrap    →  generate evaluators for each root cause
```

The pipeline pauses for your review and approval between each phase. You can exclude specific traces, adjust the failure taxonomy, or redirect the evaluator proposal before moving on.

**Examples**

```
/llm-obs-eval-pipeline my-chatbot
/llm-obs-eval-pipeline my-chatbot --timeframe now-30d --publish
```

| Option          | Default  | Description                                       |
| --------------- | -------- | ------------------------------------------------- |
| `--timeframe`   | `now-7d` | Lookback window for trace sampling                |
| `--trace-limit` | `20`     | Max traces to classify in Phase 1                 |
| `--data-only`   | off      | Emit a JSON evaluator spec instead of Python code |
| `--publish`     | off      | Publish online LLM-judge evaluators to Datadog    |

## Typical workflow{% #typical-workflow %}

If you are new to evaluating an LLM application, the recommended flow is:

1. **Run the pipeline** to get an initial read on where your app is failing and generate a first evaluator suite:

   ```
   /llm-obs-eval-pipeline <ml_app>
   ```

1. **Apply fixes.** The RCA report produced in Phase 2 includes specific before/after fix proposals grounded in trace evidence. Pass the report to a coding agent (or act on it directly) to fix system prompts, tool definitions, or routing logic in your codebase.

1. **Run an offline experiment** using the generated evaluators against a labeled dataset to validate their quality before enabling them in production. See the [Evaluation Developer Guide](https://docs.datadoghq.com/llm_observability/guide/evaluation_developer_guide.md).

1. **Publish online evaluators** once the evaluators are validated. Running `/llm-obs-eval-bootstrap` with `--publish` creates online LLM-judge evaluators in Datadog that run automatically on your production traces in real time — no code changes required:

   ```
   /llm-obs-eval-bootstrap <ml_app> --publish
   ```

1. **Monitor and iterate.** As your app evolves, re-run `/llm-obs-trace-rca` and `/llm-obs-eval-bootstrap` to catch new failure modes and keep your evaluator suite current.

- [LLM Observability Evaluations](https://docs.datadoghq.com/llm_observability/evaluations.md)
- [LLM Experiments](https://docs.datadoghq.com/llm_observability/experiments.md)
- [Evaluation Developer Guide: Build custom evaluators](https://docs.datadoghq.com/llm_observability/guide/evaluation_developer_guide.md)