For AI agents: A markdown version of this page is available at https://docs.datadoghq.com/llm_observability/experiments/datasets.md. A documentation index is available at /llms.txt.

Datasets

This product is not supported for your selected Datadog site. ().

In Agent Observability Experiments, a dataset is a collection of inputs, and expected outputs and metadata that represent scenarios you want to tests your agent on. Each dataset is associated with a project.

Each record in a dataset contains:

  • input (required): Represents all the information that the agent can access in a task.
  • expected output (optional): Also called ground truth, represents the ideal answer that the agent should output. You can use expected output to store the actual output of the app, as well as any intermediary results you want to assesss.
  • metadata (optional): Contains any useful information to categorize the record and use for further analysis. For example: topics, tags, descriptions, notes.
  • id (optional): A user-defined identifier for the record. Must be 128 characters or fewer and contain only letters, numbers, _, -, or .. If not provided, the SDK generates one automatically.

Datasets enable systematic testing and regression detection by providing consistent evaluation scenarios across experiments.

Creating a dataset

You can create datasets from production data, CSV files, or manually construct them programmatically.

To create a dataset from a CSV file, use LLMObs.create_dataset_from_csv():

# Create dataset from CSV
dataset = LLMObs.create_dataset_from_csv(
    csv_path="questions.csv",
    dataset_name="capitals-of-the-world",
    project_name="capitals-project",              # Optional: defaults to the project name from LLMObs.enable
    description="Geography quiz dataset",         # Optional: Dataset description
    input_data_columns=["question", "category"],  # Columns to use as input
    expected_output_columns=["answer"],           # Optional: Columns to use as expected output
    metadata_columns=["difficulty"],              # Optional: Additional columns as metadata
    id_column="record_id",                        # Optional: Column to use as record IDs
    csv_delimiter=","                             # Optional: Defaults to comma
)

# Example "questions.csv":
# record_id,question,category,answer,difficulty
# japan-capital,What is the capital of Japan?,geography,Tokyo,medium
# brazil-capital,What is the capital of Brazil?,geography,Brasília,medium

Notes:

  • CSV files must have a header row
  • Maximum field size is 10MB
  • All columns not specified in input_data_columns, expected_output_columns, or id_column are automatically treated as metadata
  • The dataset is automatically pushed to Datadog after creation

To manually create a dataset, use LLMObs.create_dataset():

from ddtrace.llmobs import LLMObs

dataset = LLMObs.create_dataset(
    dataset_name="capitals-of-the-world",
    project_name="capitals-project", # optional, defaults to project_name used in LLMObs.enable
    description="Questions about world capitals",
    records=[
        {
            "id": "china-capital",                                             # optional, user-defined record ID
            "input_data": {"question": "What is the capital of China?"},       # required, JSON or string
            "expected_output": "Beijing",                                      # optional, JSON or string
            "metadata": {"difficulty": "easy"}                                 # optional, JSON
        },
        {
            "input_data": {"question": "Which city serves as the capital of South Africa?"},
            "expected_output": "Pretoria",
            "metadata": {"difficulty": "medium"}
        }
    ]
)
# View dataset in Datadog UI
print(f"View dataset: {dataset.url}")

Add production traces to datasets manually through the UI or automatically with Automations.

Manual selection (UI):

  1. Navigate to AI Observability > Traces. You can also add a new Automation from Settings > Automations.
  2. Find a trace you want to include in a dataset.
  3. Click Add to Dataset.
  4. Choose an existing dataset or create a dataset.
  5. The trace’s input, output, and metadata are automatically extracted.

Automatic routing (Automations):

Automations apply going forward: new traces matching your rule are routed to the dataset as they arrive. Existing traces matching the filter are not added retroactively.

Automations enable you to continuously route production traces to datasets based on configurable rules, keeping your datasets current with production behavior without manual intervention.

To set up automatic dataset updates:

  1. Navigate to AI Observability > Traces.
  2. Apply filters to identify traces you want to route (evaluation failures, latency thresholds, specific applications). See Automation Rules > Supported filter fields for what’s allowed.
  3. Click Automate Query.
  4. Configure sampling rate (for example, 10% of matching traces).
  5. Select Add to Dataset as the action.
  6. Choose an existing dataset or create a dataset.

After creating an automation, manage it from AI Observability > Settings > Automations:

  • Enable/disable: Control whether new traces are added to the dataset.
  • Edit: Modify filters, sampling rates, or target datasets as your needs change.
  • Delete: Remove automations that are no longer needed.

Dataset limits:

  • Datasets populated by automations are capped at 20,000 records.
  • These datasets are read-only to prevent accidental modification of automated data.
  • To modify records, clone the dataset first.

Example use cases for Automations:

  • Sample 10% of traces with failed evaluations to build a failure dataset.
  • Collect edge cases where latency exceeds thresholds.
  • Maintain a diverse dataset with stratified sampling across user segments.
  • Automatically capture new failure patterns as they emerge in production.

Retrieving a dataset

To retrieve a project’s existing dataset from Datadog:

dataset = LLMObs.pull_dataset(
    dataset_name="capitals-of-the-world",
    project_name="capitals-project", # optional, defaults to the project name from LLMObs.enable
    version=1 # optional, defaults to the latest version
)

# Get dataset length
print(len(dataset))

Exporting a dataset to pandas

The Dataset class also provides the method as_dataframe(), which allows you to transform a dataset as a pandas DataFrame.

Pandas is required for this operation. To install pandas, pip install pandas.
# Convert dataset to pandas DataFrame
df = dataset.as_dataframe()
print(df.head())

# DataFrame output with MultiIndex columns:
#                                   input_data     expected_output  metadata
#    question                       category       answer           difficulty
# 0  What is the capital of Japan?  geography      Tokyo            medium
# 1  What is the capital of Brazil? geography      Brasília         medium

The DataFrame has a MultiIndex structure with the following columns:

  • input_data: Contains all input fields from input_data_columns
  • expected_output: Contains all output fields from expected_output_columns
  • metadata: Contains any additional fields from metadata_columns

Dataset versioning

Datasets are automatically versioned to track changes over time. Versioning information enables reproducibility and allows experiments to reference specific dataset versions.

The Dataset object has a field, current_version, which corresponds to the latest version; previous versions are subject to a 90-day retention window.

Dataset versions start at 0, and each new version increments the version by 1.

When new dataset versions are created

A new dataset version is created when:

  • Adding records
  • Updating records (changes to input, expected_output, or metadata fields)
  • Deleting records

Dataset versions are NOT created when updating the dataset name or description.

Version retention

  • The active version of a Dataset is retained for 3 years.
  • Previous versions (NOT the content of current_version) are retained for 90 days.
  • The 90-day retention period resets when a previous version is used — for example, when an experiment reads a version.
  • After 90 consecutive days without use, a previous version is eligible for permanent deletion and may no longer be accessible.

Example of version retention behavior

After you publish 12, 11 becomes a previous version with a 90-day window. After 25 days, you run an experiment with version 11, which causes the 90-day window to restart. After another 90 days, during which you have not used version 11, version 11 may be deleted.

Accessing and managing dataset records

You can access dataset records using standard Python indexing:

# Get a single record
record = dataset[0]

# Get multiple records
records = dataset[1:3]

# Iterate through records
for record in dataset:
    print(record["input_data"])

The Dataset class provides methods to manage records: append(), update(), delete(). You need to push() changes to save the changes in Datadog.

# Add a new record
dataset.append({
    "id": "switzerland-capital",
    "input_data": {"question": "What is the capital of Switzerland?"},
    "expected_output": "Bern",
    "metadata": {"difficulty": "easy"}
})

# Update an existing record
dataset.update(0, {
    "input_data": {"question": "What is the capital of China?"},
    "expected_output": "Beijing",
    "metadata": {"difficulty": "medium"}
})

# Delete a record
dataset.delete(1)  # Deletes the second record

# Save changes to Datadog
dataset.push()

Customizing the dataset table

When viewing a dataset’s records, you can customize the table to quickly scan and compare records without expanding each one individually.

Column picker

Use the column picker to toggle columns on or off and drag to reorder them.

Custom columns

Extract specific fields from your dataset records and display them as dedicated table columns. To add a custom column, type a field path in the Add Column input at the top of the table. You can add multiple custom columns and reorder them with drag-and-drop. Column configuration is saved to your browser’s local storage per project.