---
title: Analyze Your Experiments Results
description: How to analyze LLM Observability Experiments results.
breadcrumbs: Docs > LLM Observability > Experiments > Analyze Your Experiments Results
---

# Analyze Your Experiments Results

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com

{% alert level="danger" %}
This product is not supported for your selected [Datadog site](https://docs.datadoghq.com/getting_started/site.md). ().
{% /alert %}

{% /callout %}

This page describes how to analyze LLM Observability Experiments results in Datadog's Experiments UI and widgets.

After running an Experiment, you can analyze the results to understand performance patterns and investigate problematic records.

## Using the Experiment page in Datadog{% #using-the-experiment-page-in-datadog %}

On the Experiments page, click on an experiment to see its details.

The **Summary** section contains the evaluations, summary evaluations, and metrics that were logged during the execution of your Experiment.

Each value is aggregated based on its type:

- **Boolean**: Aggregated as the ratio of `True` over all the values recorded.
- **Score**: Aggregated as average over all values recorded.
- **Categorical**: Aggregated as the mode (most frequent value in the distribution).

The **Records** section contains traces related to the execution of your task on the dataset inputs. Each trace contains the list of spans showing the flow of information through your agent.

You can use the facets (on the left-hand side) to filter the records based on their evaluation results to uncover patterns.

### Customizing the results table{% #customizing-the-results-table %}

You can customize the experiment results table to surface the fields that matter most to you without opening each trace individually.

#### Column picker{% #column-picker %}

Use the column picker to toggle columns on or off and drag to reorder them. By default, raw JSON blob columns (Input, Output, Expected Output) are hidden to keep the table scannable. You can toggle them back on at any time.

The table includes a **Record ID** column that shows which dataset record each experiment run was executed against. For experiments with multiple runs per record, expand a record to see all runs underneath.

#### Custom columns{% #custom-columns %}

Extract specific fields from your top-level experiment span, such as an input key, output key, or metadata key, and display them as dedicated table columns. This lets you compare key properties across records at a glance.

To add a custom column, type a field path in the **Add Column** input at the top of the table:

| Source          | Path format                   | Example                        |
| --------------- | ----------------------------- | ------------------------------ |
| Input           | `@meta.input.<key>`           | `@meta.input.user_query`       |
| Output          | `@meta.output.<key>`          | `@meta.output.result.status`   |
| Expected output | `@meta.expected_output.<key>` | `@meta.expected_output.answer` |
| Metadata        | `@meta.metadata.<key>`        | `@meta.metadata.scenario_type` |
| Tag             | `<tag_key>`                   | `env`                          |

You can add multiple custom columns and reorder them with drag-and-drop. Column configuration is saved to your browser's local storage per project.

#### Quick actions from the span detail{% #quick-actions-from-the-span-detail %}

When viewing the root span in the span detail side panel, you can act on fields directly from the context menu instead of manually typing paths.

The following options are available on JSON fields (Input, Output, Expected Output, Metadata):

- **Copy key path**: Copies the field's full path (for example, `@meta.input.user_query`) so you can paste it into the custom column input, search bar, or a dashboard widget query.
- **Add column**: Adds the field as a custom column in the results table in one click.
- **Filter by / Exclude**: Adds the field's key-value pair to the search query to narrow down or exclude matching records. Available on leaf values (strings, numbers, booleans) only.

The following options are available on tags:

- **Copy key**: Copies the tag key (for example, `env`).
- **Copy to clipboard**: Copies the full tag including its value (for example, `env:prod`).
- **Add column**: Adds the tag key as a custom column in the results table.
- **Filter by / Exclude**: Adds the tag's key-value pair to the search query.

{% alert level="info" %}
These actions are available on the root span of a trace.
{% /alert %}

### Searching for specific records{% #searching-for-specific-records %}

You can use the search bar to find specific records, based on their properties (dataset records data) or on the result of the experiment (output and evaluations). The search is executed at trace level.

{% image
   source="https://docs.dd-static.net/images/llm_observability/experiments/exp_details_search.40ad3c4c8fc165d1642e689e26b13d12.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/experiments/exp_details_search.40ad3c4c8fc165d1642e689e26b13d12.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="LLM Observability, Experiment Details focus. Heading: 'Highlighting the search bar'." /%}

{% alert level="info" %}
To have access to the most data, update to `ddtrace-py >= 4.1.0`, as this version brings the following changes:
- Experiments spans contain metadata from the dataset record.
- Experiments spans' `input`, `output`, and `expected_output` fields are stored as-is (that is, as queryable JSON if they are emitted as such)
- Experiments spans and children spans are tagged with `dataset_name`, `project_name`, `project_id`, `experiment_name` for easier search.

{% /alert %}

#### Find traces by keyword{% #find-traces-by-keyword %}

Searching by keyword executes a search across all available information (input, output, expected output, metadata, tags).

#### Find traces by evaluation{% #find-traces-by-evaluation %}

To find a trace by evaluation, search: `@evaluation.<name>.value:<criteria>`

| Evaluation type | Example search term                                 |
| --------------- | --------------------------------------------------- |
| Boolean         | `@evaluation.has_risk_pred.value:true`              |
| Score           | `@evaluation.correctness.value:>=0.35`              |
| Categorical     | `@evaluation.violation.value:(not_fun OR not_nice)` |

#### Find traces by experiment status{% #find-traces-by-experiment-status %}

To find a trace by experiment status, search: `@experiment.status:<status>`

| Status    | Example search term            |
| --------- | ------------------------------ |
| Running   | `@experiment.status:running`   |
| Completed | `@experiment.status:completed` |

#### Find traces by metric{% #find-traces-by-metric %}

LLM Experiments automatically collects duration, token count, and cost metrics.

| Metric                                                            | Example search term                     |
| ----------------------------------------------------------------- | --------------------------------------- |
| Duration                                                          | `@duration:>=9.5s`                      |
| Token count                                                       | `@trace.total_tokens:>10000`            |
| Estimated total cost(in nanodollars; 1 nanodollar = 10-9 dollars) | `@trace.estimated_total_cost:>10000000` |

#### Find traces by tag{% #find-traces-by-tag %}

To find traces using tags, search `<tag>:<value>`. For example, `dataset_record_id:84dfd2af88c6441a856031fc2e43cb65`.

To see which tags are available, open a trace to find its tags.

{% image
   source="https://docs.dd-static.net/images/llm_observability/experiments/side-panel-tag.016dd1ebe7a76d2292a696d2c0972cf5.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/experiments/side-panel-tag.016dd1ebe7a76d2292a696d2c0972cf5.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="LLM Observability, Experiment trace side-panel. Highlighting where to find trace tags." /%}

#### Find traces by input, output, expected output, or metadata{% #find-traces-by-input-output-expected-output-or-metadata %}

To query a specific key in input, output, expected output, or metadata, you need to emit the property as JSON.

| Property        | Format                                           | Example search term                      |
| --------------- | ------------------------------------------------ | ---------------------------------------- |
| input           | `@meta.input.<key1>.<subkey1>:<value>`           | `@meta.input.origin.country:"France"`    |
| output          | `@meta.output.<key1>.<subkey1>:<value>`          | `@meta.output.result.status:"success"`   |
| expected output | `@meta.expected_output.<key1>.<subkey1>:<value>` | `@meta.expected_output.answer:"correct"` |
| metadata        | `@meta.metadata.<key1>.<subkey1>:<value>`        | `@meta.metadata.source:generated`        |

##### Querying JSON arrays{% #querying-json-arrays %}

Simple arrays are flattened as strings, and you can query them.

**Example 1**: Queryable JSON array

```
"output": {
  "action_matches": [
        "^/cases/settings$",
        "^Create Case Type$",
        "^/cases$"
    ],
}
```

You can query this example array by searching: `@meta.output.action_matches:"^/cases/settings$"`.

**Example 2**: Non-queryable JSON array

```
"output": {
  "expected_actions": [
    [
      "bonjour",
      "a_bientot"
    ],
    [
      "todobem",
      "click here"
    ]
  ],
}
```

In this example, the array is nested and cannot be queried.

## Using widgets with LLM Experiments data{% #using-widgets-with-llm-experiments-data %}

You can build widgets in Dashboards and Notebooks using LLM Experiments data. Datadog suggests that you:

- Populate the metadata of your dataset records with all extra information that might help you slice your data (for example: difficulty, language, etc.)
- Ensure that your tasks outputs a JSON object
- Update `ddtrace-py` version >= 4.1.0

To build a widget using LLM Experiments data, use `LLM Observability > Experiments` as data source. Then, use the search syntax on this page to narrow down the events to plot.

For record level data aggregation, use `Traces`; otherwise, use `All Spans`.

Group or filter by `@experiment.status` to compare metrics across running or completed experiments.

### Widget examples{% #widget-examples %}

#### Plotting performance over time broken down by a metadata field{% #plotting-performance-over-time-broken-down-by-a-metadata-field %}

{% image
   source="https://docs.dd-static.net/images/llm_observability/experiments/widget-metadata.08c2952caeee2cfd4a391c3c902ea9f3.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/experiments/widget-metadata.08c2952caeee2cfd4a391c3c902ea9f3.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="Widget using LLM Experiments data. Graph showing the performance over time broken down by a metadata field." /%}

{% alert level="info" %}
If you are trying to compute an average of a Boolean evaluation, you must manually compute the percentage of `True` over all traces.
{% /alert %}

#### Displaying tool usage stats{% #displaying-tool-usage-stats %}

In a situation where your agent is supposed to call a certain tool, and you need to understand how often the tool is called and get some stats about it:

{% image
   source="https://docs.dd-static.net/images/llm_observability/experiments/widget-tool.022b6df0bddf74666082cab99368bb55.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/experiments/widget-tool.022b6df0bddf74666082cab99368bb55.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="Widget using LLM Experiments data. Graph showing some usage statistics of a tool in multiple experiments." /%}
