This product is not supported for your selected
Datadog site. (
).
Overview
Annotation Queues provide a structured workflow for human review of LLM traces. Use annotation queues to:
- Review traces with complete context including spans, metadata, tool calls, inputs, outputs, and evaluation results
- Apply structured labels and free-form observations to traces
- Identify and categorize failure patterns
- Validate LLM-as-a-Judge evaluation accuracy
- Build golden datasets with human-verified labels for testing and validation
Creating an annotation queue
Navigate to AI Observability > Experiment > Annotations and select your project.
Click Create Queue
On the About tab, configure:
- Name: Descriptive name reflecting the queue’s purpose (for example, “Failed Evaluations Review - Q1 2026”)
- Project: LLM Observability project this queue belongs to
- Description (optional): Explain the queue’s purpose and any special instructions for annotators
Then click Next.
On the Schema tab, define your new queue’s label schema. Use the Preview pane to see how labels appear to annotators as you configure them. Each label can be marked as required and can optionally include:
- Assessment criteria: Allow annotators to indicate pass/fail for that label value
- Reasoning: Allow annotators to add a short explanation
Review your queue configuration and click Create to create the queue.
Step 2: Select traces for annotation
You can create add traces to queue manually from Trace Explorer or populate queues automatically using Automation Rules.
Add traces to a queue manually from the Trace Explorer:
- Navigate to AI Observability > Traces
- Filter traces using available facets (evaluation results, error status, application, time range)
- Select individual traces or bulk select multiple traces
- Click Flag for Annotation
- Choose Create New Queue or select an existing queue
Instead of manually selecting traces, use Automation Rules to route traces into annotation queues automatically based on filters and sampling criteria. This enables continuous, hands-off queue population without requiring manual trace selection.
To add an annotation queue action to an Automation Rule:
- Navigate to AI Observability > Traces
- Apply filters to identify traces you want to route (evaluation failures, latency thresholds, specific applications). See the example queries in Search Syntax.
- Click Automate Query
- Configure sampling rate (for example, 10% of matching traces).
- Under Actions, select Add to Annotation Queue.
- Choose the target queue.
- Save the rule.
Traces matching the rule’s filters are added to the queue automatically as they arrive.
Annotating traces
Accessing your queues
Navigate to AI Observability > Experiment > Annotations to see all available annotation queues. Click on a queue to see the trace list, then click Review to begin annotating.
Review Mode displays:
Applying labels
For each trace:
- Review the full trace context: Expand spans as needed to understand inputs, outputs, tool calls, and evaluation results.
- Apply labels: Fill in the configured labels based on your assessment.
- Annotations are be autosaved.
Best practices for annotation
Be consistent:
- Review the queue description and label definitions before starting.
- When multiple annotators work on the same queue, establish shared understanding of criteria.
- Document reasoning in notes for borderline cases.
Provide reasoning:
- Use free-form notes to document why you applied specific labels.
- Note patterns you observe across multiple traces.
- Reasoning helps refine evaluation criteria and understand failure modes.
Managing queues
Tracking queue progress
The Annotations list page displays a progress bar for each queue showing the ratio of reviewed interactions to total interactions. Use this to monitor annotation completion across queues at a glance.
Filtering traces by annotation labels
use the Annotation Labels facet to filter traces by labels applied in annotation queues. This allows you to:
- Find all traces tagged with a specific failure mode (for example,
failure_type: hallucination) - Build targeted samples for downstream review, dataset creation, or CSV export for data analysis
Editing queue schema
You can modify a queue’s label schema after creation:
- Navigate to AI Observability > Experiment > Annotations.
- Open the queue.
- If the Details panel is hidden, click View Details.
- Click Edit.
- Add, remove, or modify labels.
- Click Save Changes.
Changing the schema doesn't affect already-applied labels, but annotators will see the updated schema going forward.
Exporting annotated data
Export annotated traces for analysis or use in other workflows:
- Navigate to AI Observability > Experiment > Annotations.
- Open the queue.
- Select traces (or select all).
- Click Export.
Adding to datasets
Transfer annotated traces to datasets for experiment evaluation:
- Navigate to AI Observability > Experiment > Annotations.
- Open the queue.
- Select traces to transfer.
- Click Add to Dataset.
- Choose an existing dataset, or create a dataset.
Labels are included with each trace as metadata.
See Datasets for more information about using datasets in experiments.
Deleting queues
To delete a queue:
- Navigate to AI Observability > Experiment > Annotations.
- Open the queue.
- Click Delete in the Details panel.
Deleting a queue removes the queue and label associations, but does not delete the underlying traces from LLM Observability. Traces remain accessible in Trace Explorer.
Data retention
| Data | Retention period |
|---|
| Traces in queues | Capped by your organization’s trace retention period |
| Annotation labels | Indefinite |
Example workflows
Review failed traces to identify recurring patterns and categorize how your application fails in production.
- Filter traces in Trace Explorer for failed evaluations or specific error patterns
- Manually select traces and add to an annotation queue
- Annotators review traces and document failure types in free-form notes
- Common patterns emerge: hallucinations in specific contexts, formatting issues, inappropriate refusals
- Create categorical labels for identified failure modes and re-code traces
- Use failure mode distribution to prioritize fixes
Queue configuration
Labels: Free-form notes, categorical failure_type label, pass/fail rating
Annotators: Product managers, engineers, domain experts
Find traces where automated evaluators may be uncertain or incorrect, then have humans provide ground truth.
- Sample evaluation results: all results, or a given score/threshold
- Add selected traces to an annotation queue
- Annotators review traces and provide human scores for the same criteria
- Compare human labels to automated evaluation scores
- Identify systematic disagreements (judge too strict, too lenient, or misunderstanding criteria)
- Refine evaluation prompts based on disagreements
Queue configuration
Labels: Numeric scores matching evaluation criteria (0-10), categorical judge_accuracy label, reasoning notes
Annotators: Subject matter experts who understand evaluation criteria
Build benchmark datasets with human-verified labels for regression testing and continuous validation.
- Sample diverse production traces from Trace Explorer (both good and bad examples)
- Add traces to annotation queue
- Annotators review and label traces across multiple quality dimensions
- Add high-confidence, well-labeled examples to golden dataset
- Use dataset for CI/CD regression testing of prompt changes
- Continuously expand dataset with new edge cases
Queue configuration
Labels: Multiple categorical labels covering quality dimensions, numeric scores, pass/fail rating, notes
Annotators: Team of domain experts for consistency
Further Reading
Additional helpful documentation, links, and articles: