---
title: Annotation Queues
description: >-
  Enable systematic human review of LLM traces to identify failure modes,
  validate automated evaluations, and build golden datasets.
breadcrumbs: Docs > LLM Observability > Evaluations > Annotation Queues
---

# Annotation Queues

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com

{% alert level="danger" %}
This product is not supported for your selected [Datadog site](https://docs.datadoghq.com/getting_started/site). ().
{% /alert %}

{% /callout %}

## Overview{% #overview %}

Annotation Queues provide a structured workflow for human review of LLM traces. Use annotation queues to:

- Review traces with complete context including spans, metadata, tool calls, inputs, outputs, and evaluation results
- Apply structured labels and free-form observations to traces
- Identify and categorize failure patterns
- Validate LLM-as-a-Judge evaluation accuracy
- Build golden datasets with human-verified labels for testing and validation

## Creating an annotation queue{% #creating-an-annotation-queue %}

### Step 1: Configure queue settings{% #step-1-configure-queue-settings %}

1. Navigate to [**AI Observability > Experiment > Annotations**](https://app.datadoghq.com/llm/annotations/queues) and select your project.

1. Click **Create Queue**

1. On the **About** tab, configure:

   - **Name**: Descriptive name reflecting the queue's purpose (for example, "Failed Evaluations Review - Q1 2026")
   - **Project**: LLM Observability project this queue belongs to
   - **Description** (optional): Explain the queue's purpose and any special instructions for annotators

1. Then click **Next**.

1. On the **Schema** tab, define your new queue's label schema. Use the Preview pane to see how labels appear to annotators as you configure them. Each label can be marked as required and can optionally include:

   - **Assessment criteria**: Allow annotators to indicate pass/fail for that label value
   - **Reasoning**: Allow annotators to add a short explanation

1. Review your queue configuration and click **Create** to create the queue.

   {% image
      source="https://datadog-docs.imgix.net/images/llm_observability/evaluations/annotation_queues/schema_edit.a609351cbdde06a053e7e813868f061f.png?auto=format"
      alt="The Edit Queue modal showing the Schema tab with label configuration on the left and a preview pane on the right. The left panel displays fields for configuring a categorical label named failure_type with three categories: hallucination, formatting_error, and refusal. Checkboxes enable Assessment Criteria and Reasoning options. The right preview pane shows how the label appears to annotators with checkboxes for each category, Pass/Fail assessment buttons, and a reasoning text field." /%}

### Step 2: Select traces for annotation{% #step-2-select-traces-for-annotation %}

You can create add traces to queue manually from Trace Explorer or populate queues automatically using Automation Rules.

{% tab title="Manually from Trace Explorer" %}
Add traces to a queue manually from the Trace Explorer:

1. Navigate to [**AI Observability > Traces**](https://app.datadoghq.com/llm/traces)
1. Filter traces using available facets (evaluation results, error status, application, time range)
1. Select individual traces or bulk select multiple traces
1. Click **Flag for Annotation**
1. Choose **Create New Queue** or select an existing queue

{% /tab %}

{% tab title="Using Automation Rules" %}
Instead of manually selecting traces, use Automation Rules to route traces into annotation queues automatically based on filters and sampling criteria. This enables continuous, hands-off queue population without requiring manual trace selection.

To add an annotation queue action to an Automation Rule:

1. Navigate to [**AI Observability > Traces**](https://app.datadoghq.com/llm/traces)
1. Apply filters to identify traces you want to route (evaluation failures, latency thresholds, specific applications). See the example queries in [Search Syntax](https://docs.datadoghq.com/logs/explorer/search_syntax/).
1. Click **Automate Query**
1. Configure sampling rate (for example, 10% of matching traces).
1. Under **Actions**, select **Add to Annotation Queue**.
1. Choose the target queue.
1. Save the rule.

Traces matching the rule's filters are added to the queue automatically as they arrive.
{% /tab %}

## Annotating traces{% #annotating-traces %}

### Accessing your queues{% #accessing-your-queues %}

Navigate to [**AI Observability > Experiment > Annotations**](https://app.datadoghq.com/llm/annotations/queues) to see all available annotation queues. Click on a queue to see the trace list, then click **Review** to begin annotating.

Review Mode displays:

- **Full trace context** (right panel):

  - Complete span tree with inputs, outputs, metadata
  - Tool calls and intermediate reasoning steps
  - Evaluation results on trace and individual spans

- **Annotation controls** (left panel):

  - Configured labels for this queue
  - Progress indicator showing position in queue
  - Navigation controls (Previous, Next)

  {% image
     source="https://datadog-docs.imgix.net/images/llm_observability/evaluations/annotation_queues/review.5d52d7b14a54e6ba64e36cefaf677a30.png?auto=format"
     alt="The annotation review interface showing the annotation panel on the left and trace details on the right. The left panel displays label controls including failure_type checkboxes for hallucination, formatting_error, and refusal, plus a requires_escalation assessment with Pass and Fail buttons and a Save button at the bottom. The right panel shows the trace details for citizen_agent with a span tree, evaluation results, and expandable sections for Input and Output displaying JSON-formatted data about a weather information query." /%}

### Applying labels{% #applying-labels %}

For each trace:

1. **Review the full trace context**: Expand spans as needed to understand inputs, outputs, tool calls, and evaluation results.
1. **Apply labels**: Fill in the configured labels based on your assessment.
1. Annotations are be autosaved.

### Best practices for annotation{% #best-practices-for-annotation %}

**Be consistent**:

- Review the queue description and label definitions before starting.
- When multiple annotators work on the same queue, establish shared understanding of criteria.
- Document reasoning in notes for borderline cases.

**Provide reasoning**:

- Use free-form notes to document why you applied specific labels.
- Note patterns you observe across multiple traces.
- Reasoning helps refine evaluation criteria and understand failure modes.

## Managing queues{% #managing-queues %}

### Tracking queue progress{% #tracking-queue-progress %}

The Annotations list page displays a progress bar for each queue showing the ratio of reviewed interactions to total interactions. Use this to monitor annotation completion across queues at a glance.

### Filtering traces by annotation labels{% #filtering-traces-by-annotation-labels %}

use the **Annotation Labels** facet to filter traces by labels applied in annotation queues. This allows you to:

- Find all traces tagged with a specific failure mode (for example, `failure_type: hallucination`)
- Build targeted samples for downstream review, dataset creation, or CSV export for data analysis

### Editing queue schema{% #editing-queue-schema %}

You can modify a queue's label schema after creation:

1. Navigate to [**AI Observability > Experiment > Annotations**](https://app.datadoghq.com/llm/annotations/queues).
1. Open the queue.
1. If the Details panel is hidden, click **View Details**.
1. Click **Edit**.
1. Add, remove, or modify labels.
1. Click **Save Changes**.

{% alert level="info" %}
Changing the schema doesn't affect already-applied labels, but annotators will see the updated schema going forward.
{% /alert %}

### Exporting annotated data{% #exporting-annotated-data %}

Export annotated traces for analysis or use in other workflows:

1. Navigate to [**AI Observability > Experiment > Annotations**](https://app.datadoghq.com/llm/annotations/queues).
1. Open the queue.
1. Select traces (or select all).
1. Click **Export**.

### Adding to datasets{% #adding-to-datasets %}

Transfer annotated traces to datasets for experiment evaluation:

1. Navigate to [**AI Observability > Experiment > Annotations**](https://app.datadoghq.com/llm/annotations/queues).
1. Open the queue.
1. Select traces to transfer.
1. Click **Add to Dataset**.
1. Choose an existing dataset, or create a dataset.

Labels are included with each trace as metadata.

See [Datasets](https://docs.datadoghq.com/llm_observability/experiments/datasets) for more information about using datasets in experiments.

### Deleting queues{% #deleting-queues %}

To delete a queue:

1. Navigate to [**AI Observability > Experiment > Annotations**](https://app.datadoghq.com/llm/annotations/queues).
1. Open the queue.
1. Click **Delete** in the Details panel.

{% alert level="info" %}
Deleting a queue removes the queue and label associations, but does not delete the underlying traces from LLM Observability. Traces remain accessible in Trace Explorer.
{% /alert %}

## Data retention{% #data-retention %}

| Data              | Retention period                                     |
| ----------------- | ---------------------------------------------------- |
| Traces in queues  | Capped by your organization's trace retention period |
| Annotation labels | Indefinite                                           |

## Example workflows{% #example-workflows %}

{% collapsible-section
   open=null
   #example-error-analysis-and-failure-mode-discovery %}
### Error analysis and failure mode discovery

Review failed traces to identify recurring patterns and categorize how your application fails in production.

1. Filter traces in Trace Explorer for failed evaluations or specific error patterns
1. Manually select traces and add to an annotation queue
1. Annotators review traces and document failure types in free-form notes
1. Common patterns emerge: hallucinations in specific contexts, formatting issues, inappropriate refusals
1. Create categorical labels for identified failure modes and re-code traces
1. Use failure mode distribution to prioritize fixes

#### Queue configuration{% #queue-configuration %}

- **Labels**: Free-form notes, categorical `failure_type` label, pass/fail rating

- **Annotators**: Product managers, engineers, domain experts

{% /collapsible-section %}

{% collapsible-section
   open=null
   #example-validating-llm-as-a-judge-evaluations %}
### Validating LLM-as-a-Judge evaluations

Find traces where automated evaluators may be uncertain or incorrect, then have humans provide ground truth.

1. Sample evaluation results: all results, or a given score/threshold
1. Add selected traces to an annotation queue
1. Annotators review traces and provide human scores for the same criteria
1. Compare human labels to automated evaluation scores
1. Identify systematic disagreements (judge too strict, too lenient, or misunderstanding criteria)
1. Refine evaluation prompts based on disagreements

#### Queue configuration{% #queue-configuration-1 %}

- **Labels**: Numeric scores matching evaluation criteria (0-10), categorical `judge_accuracy` label, reasoning notes

- **Annotators**: Subject matter experts who understand evaluation criteria

{% /collapsible-section %}

{% collapsible-section open=null #example-golden-dataset-creation %}
### Golden dataset creation

Build benchmark datasets with human-verified labels for regression testing and continuous validation.

1. Sample diverse production traces from Trace Explorer (both good and bad examples)
1. Add traces to annotation queue
1. Annotators review and label traces across multiple quality dimensions
1. Add high-confidence, well-labeled examples to golden dataset
1. Use dataset for CI/CD regression testing of prompt changes
1. Continuously expand dataset with new edge cases

#### Queue configuration{% #queue-configuration-2 %}

- **Labels**: Multiple categorical labels covering quality dimensions, numeric scores, pass/fail rating, notes

- **Annotators**: Team of domain experts for consistency

{% /collapsible-section %}

## Further Reading{% #further-reading %}

- [Learn about evaluation types](https://docs.datadoghq.com/llm_observability/evaluations/evaluation_types)
- [Run experiments to test improvements](https://docs.datadoghq.com/llm_observability/experiments)
