---
title: Annotation Queues
description: >-
  Enable systematic human review of LLM traces to identify failure modes,
  validate automated evaluations, and build golden datasets.
breadcrumbs: Docs > Agent Observability > Evaluations > Annotation Queues
---

# Annotation Queues

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com, us2.ddog-gov.com

{% alert level="danger" %}
This product is not supported for your selected [Datadog site](https://docs.datadoghq.com/getting_started/site.md). ({% placeholder "user-datadog-site-name" /%}).
{% /alert %}

{% /callout %}

## Overview{% #overview %}

Annotation Queues provide a structured workflow for human review of LLM traces. Use annotation queues to:

- Review traces with complete context including spans, metadata, tool calls, inputs, outputs, and evaluation results
- Apply structured labels and free-form observations to traces
- Identify and categorize failure patterns
- Validate LLM-as-a-Judge evaluation accuracy
- Build golden datasets with human-verified labels for testing and validation

## Creating an annotation queue{% #creating-an-annotation-queue %}

### Step 1: Configure queue settings{% #step-1-configure-queue-settings %}

1. Navigate to [AI Observability > Experiment > Annotations](https://app.datadoghq.com/llm/annotations/queues) and select your project.

1. Click Create Queue

1. On the About tab, configure:

   - Name: Descriptive name reflecting the queue's purpose (for example, "Failed Evaluations Review - Q1 2026")
   - Project: Agent Observability project this queue belongs to
   - Description (optional): Explain the queue's purpose and any special instructions for annotators

1. Then click Next.

1. On the Schema tab, define your new queue's label schema. Use the Preview pane to see how labels appear to annotators as you configure them. Each label can be marked as required and can optionally include:

   - Assessment criteria: Allow annotators to indicate pass/fail for that label value
   - Reasoning: Allow annotators to add a short explanation

1. Review your queue configuration and click Create to create the queue.

   {% image
      source="https://docs.dd-static.net/images/llm_observability/evaluations/annotation_queues/schema_edit.a609351cbdde06a053e7e813868f061f.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/evaluations/annotation_queues/schema_edit.a609351cbdde06a053e7e813868f061f.png?auto=format&fit=max&w=850&dpr=2 2x"
      alt="The Edit Queue modal showing the Schema tab with label configuration on the left and a preview pane on the right. The left panel displays fields for configuring a categorical label named failure_type with three categories: hallucination, formatting_error, and refusal. Checkboxes enable Assessment Criteria and Reasoning options. The right preview pane shows how the label appears to annotators with checkboxes for each category, Pass/Fail assessment buttons, and a reasoning text field." /%}

### Step 2: Select traces for annotation{% #step-2-select-traces-for-annotation %}

You can create add traces to queue manually from Trace Explorer or populate queues automatically using Automation Rules.

{% tab title="Manually from Trace Explorer" %}
Add traces to a queue manually from the Trace Explorer:

1. Navigate to [AI Observability > Traces](https://app.datadoghq.com/llm/traces)
1. Filter traces using available facets (evaluation results, error status, application, time range)
1. Select individual traces or bulk select multiple traces
1. Click Flag for Annotation
1. Choose Create New Queue or select an existing queue

{% /tab %}

{% tab title="Using Automation Rules" %}
Instead of manually selecting traces, use Automation Rules to route traces into annotation queues automatically based on filters and sampling criteria. This enables continuous, hands-off queue population without requiring manual trace selection. See [Automation Rules](https://docs.datadoghq.com/llm_observability/monitoring/automation_rules.md) for the full feature reference, including supported filter fields and limits.

{% alert level="info" %}
Automations apply going forward: new traces matching your rule are routed to the queue as they arrive. Existing traces matching the filter are not added retroactively.
{% /alert %}

To add an annotation queue action to an Automation Rule:

1. Navigate to [AI Observability > Traces](https://app.datadoghq.com/llm/traces)
1. Apply filters to identify traces you want to route (evaluation failures, latency thresholds, specific applications). See [Automation Rules > Supported filter fields](https://docs.datadoghq.com/llm_observability/monitoring/automation_rules.md#supported-filter-fields) for what's allowed.
1. Click Automate Query
1. Configure sampling rate (up to 5% for annotation queues; for example, 2% of matching traces).
1. Under Actions, select Add to Annotation Queue.
1. Choose the target queue.
1. Save the rule.

Traces matching the rule's filters are added to the queue as they arrive. Annotation queues hold up to 1,000 records; the automation pauses when the queue reaches that limit.
{% /tab %}

## Annotating traces{% #annotating-traces %}

### Accessing your queues{% #accessing-your-queues %}

Navigate to [AI Observability > Experiment > Annotations](https://app.datadoghq.com/llm/annotations/queues) to see all available annotation queues. Click on a queue to see the trace list, then click Review to begin annotating.

Review Mode displays:

- Full trace context (right panel):

  - Complete span tree with inputs, outputs, metadata
  - Tool calls and intermediate reasoning steps
  - Evaluation results on trace and individual spans

- Annotation controls (left panel):

  - Configured labels for this queue
  - Progress indicator showing position in queue
  - Navigation controls (Previous, Next)

  {% image
     source="https://docs.dd-static.net/images/llm_observability/evaluations/annotation_queues/review.5d52d7b14a54e6ba64e36cefaf677a30.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/evaluations/annotation_queues/review.5d52d7b14a54e6ba64e36cefaf677a30.png?auto=format&fit=max&w=850&dpr=2 2x"
     alt="The annotation review interface showing the annotation panel on the left and trace details on the right. The left panel displays label controls including failure_type checkboxes for hallucination, formatting_error, and refusal, plus a requires_escalation assessment with Pass and Fail buttons and a Save button at the bottom. The right panel shows the trace details for citizen_agent with a span tree, evaluation results, and expandable sections for Input and Output displaying JSON-formatted data about a weather information query." /%}

### Applying labels{% #applying-labels %}

For each trace:

1. **Review the full trace context**: Expand spans as needed to understand inputs, outputs, tool calls, and evaluation results.
1. **Apply labels**: Fill in the configured labels based on your assessment.
1. Annotations are be autosaved.

### Best practices for annotation{% #best-practices-for-annotation %}

**Be consistent**:

- Review the queue description and label definitions before starting.
- When multiple annotators work on the same queue, establish shared understanding of criteria.
- Document reasoning in notes for borderline cases.

**Provide reasoning**:

- Use free-form notes to document why you applied specific labels.
- Note patterns you observe across multiple traces.
- Reasoning helps refine evaluation criteria and understand failure modes.

## Managing queues{% #managing-queues %}

### Tracking queue progress{% #tracking-queue-progress %}

The Annotations list page displays a progress bar for each queue showing the ratio of reviewed interactions to total interactions. Use this to monitor annotation completion across queues at a glance.

### Filtering traces by annotation labels{% #filtering-traces-by-annotation-labels %}

use the Annotation Labels facet to filter traces by labels applied in annotation queues. This allows you to:

- Find all traces tagged with a specific failure mode (for example, `failure_type: hallucination`)
- Build targeted samples for downstream review, dataset creation, or CSV export for data analysis

### Editing queue schema{% #editing-queue-schema %}

You can modify a queue's label schema after creation:

1. Navigate to [AI Observability > Experiment > Annotations](https://app.datadoghq.com/llm/annotations/queues).
1. Open the queue.
1. If the Details panel is hidden, click View Details.
1. Click Edit.
1. Add, remove, or modify labels.
1. Click Save Changes.

{% alert level="info" %}
Changing the schema doesn't affect already-applied labels, but annotators will see the updated schema going forward.
{% /alert %}

### Exporting annotated data{% #exporting-annotated-data %}

Export annotated traces for analysis or use in other workflows:

1. Navigate to [AI Observability > Experiment > Annotations](https://app.datadoghq.com/llm/annotations/queues).
1. Open the queue.
1. Select traces (or select all).
1. Click Export.

The file downloads as `annotations_<queue-id>.csv`. You can also retrieve span data programmatically using the [Export API](https://docs.datadoghq.com/llm_observability/evaluations/export_api.md?tab=model#api-standards).

{% collapsible-section #csv-format %}
#### CSV format

Each row represents one annotated interaction. The file begins with these fixed columns:

| Column            | Description                                                                                |
| ----------------- | ------------------------------------------------------------------------------------------ |
| `Content ID`      | ID of the annotated content (for example, a trace ID or session ID)                        |
| `Type`            | Interaction type: `trace`, `experiment_trace`, or `session`                                |
| `Input`           | Input summary (empty for session interactions)                                             |
| `Output`          | Output summary (empty for session interactions)                                            |
| `Expected Output` | Only present when Include Expected Output is enabled; populated for experiment traces only |

After the fixed columns, there is one set of columns per reviewer per label. Reviewers are sorted alphabetically by display name (spaces replaced with underscores). Labels follow the order defined in the queue schema:

| Column                          | Description                                                        |
| ------------------------------- | ------------------------------------------------------------------ |
| `{reviewer}_{label}`            | Label value (string, number, boolean, or JSON array)               |
| `{reviewer}_{label}_assessment` | `pass` or `fail`, if assessment criteria is enabled for that label |
| `{reviewer}_{label}_reasoning`  | Free-text reasoning, if reasoning is enabled for that label        |

If a reviewer has not annotated a given row, those cells are empty.

**Example**: A queue with reviewers Alice Johnson and Bob Smith and labels `quality` (score) and `failure_type` (categorical) produces these column headers:

```
Content ID,Type,Input,Output,Alice_Johnson_quality,Alice_Johnson_quality_assessment,Alice_Johnson_quality_reasoning,Alice_Johnson_failure_type,Alice_Johnson_failure_type_assessment,Alice_Johnson_failure_type_reasoning,Bob_Smith_quality,...
```

{% /collapsible-section %}

#### Retrieve spans by trace ID or session ID{% #retrieve-spans-by-trace-id-or-session-id %}

After exporting annotation data, use the [Export API](https://docs.datadoghq.com/llm_observability/evaluations/export_api.md?tab=model#api-standards) to retrieve the full span data for traces or sessions in the CSV and join it with your annotation labels.

**By trace ID**:

```bash
curl -G "https://api.datadoghq.com/api/v2/llm-obs/v1/spans/events" \
  -H "DD-API-KEY: <YOUR_DATADOG_API_KEY>" \
  -H "DD-APPLICATION-KEY: <YOUR_DATADOG_APPLICATION_KEY>" \
  --data-urlencode "filter[trace_id]=<TRACE_ID>"
```

**By session ID**:

```bash
curl -G "https://api.datadoghq.com/api/v2/llm-obs/v1/spans/events" \
  -H "DD-API-KEY: <YOUR_DATADOG_API_KEY>" \
  -H "DD-APPLICATION-KEY: <YOUR_DATADOG_APPLICATION_KEY>" \
  --data-urlencode "filter[query]=@session_id:<SESSION_ID>"
```

### Adding to datasets{% #adding-to-datasets %}

Transfer annotated traces to datasets for experiment evaluation:

1. Navigate to [AI Observability > Experiment > Annotations](https://app.datadoghq.com/llm/annotations/queues).
1. Open the queue.
1. Select traces to transfer.
1. Click Add to Dataset.
1. Set the dataset's expected output:
   - From interaction: use each trace's actual output. For experiment traces, you can also pick Expected output to use the original expected output from the experiment's source dataset.
   - From annotation label: use the values the annotators applied. Pick one or more labels. The record's `expected_output` is built from your selection.
1. Choose an existing dataset, or create a dataset.

When **expected output** is built from annotation labels, the exported value is a JSON object keyed by label name, for example `{ "is_harmful": false, "tone": ["neutral"], "topics": ["safety", "policy"] }`. The same shape applies whether you select one label or multiple labels. Categorical labels are always exported as arrays of selected options, whether the label is single-select or multi-select.

{% collapsible-section #annotation-aggregation %}
#### How annotation values are aggregated across annotators

When multiple annotators have annotated the same trace, the value for each label is aggregated across them by consensus:

| Label type  | Aggregation                                                           |
| ----------- | --------------------------------------------------------------------- |
| Boolean     | Majority vote (ties break in favor of `true`)                         |
| Categorical | Intersection: the sorted set of options that every annotator selected |
| Score       | Average                                                               |
| Text        | List of responses                                                     |

For categorical labels (single-select or multi-select), the aggregated value is the sorted array of options that *every* annotator selected. If any annotator's selection differs, the value is an empty array. The result is always an array, even when only one annotator has annotated the trace.

**Example: categorical (consensus).** Three annotators rate `tone` and all agree:

- Annotator A: `polite`
- Annotator B: `polite`
- Annotator C: `polite`

Aggregated: `["polite"]`.

**Example: categorical (disagreement).** Three annotators rate `tone` and one differs:

- Annotator A: `polite`
- Annotator B: `rude`
- Annotator C: `polite`

Aggregated: `[]`. The intersection is empty because `rude` is not in every annotator's set.

**Example: categorical (multi-select).** Three annotators tag `topics` (each can pick multiple options):

- Annotator A: `["safety", "policy"]`
- Annotator B: `["safety", "billing"]`
- Annotator C: `["safety", "policy"]`

Aggregated: `["safety"]`. Only `safety` appears in every annotator's set; `policy` is missing from B's selection and `billing` is missing from A's and C's.

**Example: text.** Two annotators leave notes:

- Annotator A: `"Confusing phrasing"`
- Annotator B: `"Tone too casual"`

Aggregated: `["Confusing phrasing", "Tone too casual"]`. Every annotator's value is preserved.

Raw per-annotator values are preserved in each record's metadata, along with annotator identity. If the default consensus doesn't fit your workflow, you can recompute with a different strategy (for example, median, weighted vote, or reviewer pick).
{% /collapsible-section %}

Labels not selected as expected output are also included with each trace as metadata.

See [Datasets](https://docs.datadoghq.com/llm_observability/experiments/datasets.md) for more information about using datasets in experiments.

### Deleting queues{% #deleting-queues %}

To delete a queue:

1. Navigate to [AI Observability > Experiment > Annotations](https://app.datadoghq.com/llm/annotations/queues).
1. Open the queue.
1. Click Delete in the Details panel.

{% alert level="info" %}
Deleting a queue removes the queue and label associations, but does not delete the underlying traces from Agent Observability. Traces remain accessible in Trace Explorer.
{% /alert %}

## Using the API{% #using-the-api %}

You can manage annotation queues programmatically. The following endpoints are available in the [Agent Observability API reference](https://docs.datadoghq.com/api/latest/llm-observability.md):

| Endpoint                                                                                                                              | Description                                                                                                                                  |
| ------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
| [List annotation queues](https://docs.datadoghq.com/api/latest/llm-observability.md#list-llm-observability-annotation-queues)         | List all annotation queues in your organization.                                                                                             |
| [Create an annotation queue](https://docs.datadoghq.com/api/latest/llm-observability.md#create-an-llm-observability-annotation-queue) | Create an annotation queue. `name` and `project_id` are required. Include an optional `annotation_schema` to define labels at creation time. |
| [Update an annotation queue](https://docs.datadoghq.com/api/latest/llm-observability.md#update-an-llm-observability-annotation-queue) | Partially update a queue's `name`, `description`, or `annotation_schema`.                                                                    |
| [Delete an annotation queue](https://docs.datadoghq.com/api/latest/llm-observability.md#delete-an-llm-observability-annotation-queue) | Delete an annotation queue by ID.                                                                                                            |
| [Add interactions to a queue](https://docs.datadoghq.com/api/latest/llm-observability.md#add-annotation-queue-interactions)           | Add one or more traces to an annotation queue for review.                                                                                    |
| [Delete interactions from a queue](https://docs.datadoghq.com/api/latest/llm-observability.md#delete-annotation-queue-interactions)   | Remove specific interactions from a queue by interaction ID.                                                                                 |
| [Get annotated interactions](https://docs.datadoghq.com/api/latest/llm-observability.md#get-annotated-queue-interactions)             | Retrieve all interactions and their applied annotation labels for a queue.                                                                   |
| [Get label schema](https://docs.datadoghq.com/api/latest/llm-observability.md#get-annotation-queue-label-schema)                      | Retrieve the label schema configured for a queue.                                                                                            |
| [Update label schema](https://docs.datadoghq.com/api/latest/llm-observability.md#update-annotation-queue-label-schema)                | Create or replace the label schema for a queue.                                                                                              |

## Data retention{% #data-retention %}

| Data              | Retention period                                     |
| ----------------- | ---------------------------------------------------- |
| Traces in queues  | Capped by your organization's trace retention period |
| Annotation labels | Indefinite                                           |

## Example workflows{% #example-workflows %}

{% collapsible-section
   open=null
   #example-error-analysis-and-failure-mode-discovery %}
### Error analysis and failure mode discovery

Review failed traces to identify recurring patterns and categorize how your application fails in production.

1. Filter traces in Trace Explorer for failed evaluations or specific error patterns
1. Manually select traces and add to an annotation queue
1. Annotators review traces and document failure types in free-form notes
1. Common patterns emerge: hallucinations in specific contexts, formatting issues, inappropriate refusals
1. Create categorical labels for identified failure modes and re-code traces
1. Use failure mode distribution to prioritize fixes

#### Queue configuration{% #queue-configuration %}

- **Labels**: Free-form notes, categorical `failure_type` label, pass/fail rating

- **Annotators**: Product managers, engineers, domain experts

{% /collapsible-section %}

{% collapsible-section
   open=null
   #example-validating-llm-as-a-judge-evaluations %}
### Validating LLM-as-a-Judge evaluations

Find traces where automated evaluators may be uncertain or incorrect, then have humans provide ground truth.

1. Sample evaluation results: all results, or a given score/threshold
1. Add selected traces to an annotation queue
1. Annotators review traces and provide human scores for the same criteria
1. Compare human labels to automated evaluation scores
1. Identify systematic disagreements (judge too strict, too lenient, or misunderstanding criteria)
1. Refine evaluation prompts based on disagreements

#### Queue configuration{% #queue-configuration-1 %}

- **Labels**: Numeric scores matching evaluation criteria (0-10), categorical `judge_accuracy` label, reasoning notes

- **Annotators**: Subject matter experts who understand evaluation criteria

{% /collapsible-section %}

{% collapsible-section open=null #example-golden-dataset-creation %}
### Golden dataset creation

Build benchmark datasets with human-verified labels for regression testing and continuous validation.

1. Sample diverse production traces from Trace Explorer (both good and bad examples)
1. Add traces to annotation queue
1. Annotators review and label traces across multiple quality dimensions
1. Add high-confidence, well-labeled examples to golden dataset
1. Use dataset for CI/CD regression testing of prompt changes
1. Continuously expand dataset with new edge cases

#### Queue configuration{% #queue-configuration-2 %}

- **Labels**: Multiple categorical labels covering quality dimensions, numeric scores, pass/fail rating, notes

- **Annotators**: Team of domain experts for consistency

{% /collapsible-section %}

## Further Reading{% #further-reading %}

- [Learn about evaluation types](https://docs.datadoghq.com/llm_observability/evaluations/evaluation_types.md)
- [Route traces into queues automatically with Automation Rules](https://docs.datadoghq.com/llm_observability/monitoring/automation_rules.md)
- [Run experiments to test improvements](https://docs.datadoghq.com/llm_observability/experiments.md)
- [Annotate traces to improve LLM quality with Datadog LLM Observability](https://www.datadoghq.com/blog/automations-annotation-queues)
- [Agent Observability API reference](https://docs.datadoghq.com/api/latest/llm-observability.md)