Annotation Queues

Documentos > Observabilidad de LLM > Evaluaciones > Annotation Queues

Esta página aún no está disponible en español. Estamos trabajando en su traducción.
Si tienes alguna pregunta o comentario sobre nuestro actual proyecto de traducción, no dudes en ponerte en contacto con nosotros.

Overview

Preview Feature

Annotation Queues are in Preview. To request access, contact ml-observability-product@datadoghq.com.

Annotation Queues provide a structured workflow for human review of LLM traces. Use annotation queues to:

Review traces with complete context including spans, metadata, tool calls, inputs, outputs, and evaluation results
Apply structured labels and free-form observations to traces
Identify and categorize failure patterns
Validate LLM-as-a-Judge evaluation accuracy
Build golden datasets with human-verified labels for testing and validation

Create and use an annotation queue

Navigate to AI Observability > Experiment > Annotations and select your project.
Click Create Queue.
On the About tab, configure:
- Name: Descriptive name reflecting the queue’s purpose (for example, “Failed Evaluations Review - Q1 2026”)
- Project: LLM Observability project this queue belongs to
- Description (optional): Explain the queue’s purpose and any special instructions for annotators
Then click Next.
On the Schema tab, define your new queue’s label schema. Use the Preview pane to see how labels appear to annotators as you configure them.
Then click Create.
Add interactions.
1. Navigate to AI Observability > Traces.
2. Filter traces using available facets (evaluation results, error status, application, time range).
3. Click on an individual trace, or bulk select multiple traces.
4. Click Flag for Annotation.
5. Select your queue from the drop-down.
Return to the annotation queue you created, and click Review to begin annotating.

Managing queues

Editing queue schema

You can modify a queue’s label schema after creation:

Navigate to AI Observability > Experiment > Annotations
Open the queue
If the Details panel is hidden, click View Details
Click Edit
Add, remove, or modify labels
Click Save Changes

Changing the schema doesn't affect already-applied labels, but annotators will see the updated schema going forward.

Exporting annotated data

Export annotated traces for analysis or use in other workflows:

Navigate to AI Observability > Experiment > Annotations
Open the queue
Select traces (or select all)
Click Export

Adding to datasets

Transfer annotated traces to datasets for experiment evaluation:

Navigate to AI Observability > Experiment > Annotations
Open the queue
Select traces to transfer
Click Add to Dataset
Choose an existing dataset, or create a dataset

Labels are included with each trace as metadata.

See Datasets for more information about using datasets in experiments.

Deleting queues

To delete a queue:

Navigate to AI Observability > Experiment > Annotations
Open the queue
Click Delete in the Details panel

Deleting a queue removes the queue and label associations, but does not delete the underlying traces from LLM Observability. Traces remain accessible in Trace Explorer.

Data retention

Data	Retention period
Traces in queues	15 days
Annotation labels	Indefinite

Best practices for annotation

Be consistent:

Review the queue description and label definitions before starting
When multiple annotators work on the same queue, establish shared understanding of criteria
Document reasoning in notes for borderline cases

Provide reasoning:

Use free-form notes to document why you applied specific labels
Note patterns you observe across multiple traces
Reasoning helps refine evaluation criteria and understand failure modes

Example workflows

Error analysis and failure mode discovery

Review failed traces to identify recurring patterns and categorize how your application fails in production.

Filter traces in Trace Explorer for failed evaluations or specific error patterns
Manually select traces and add to an annotation queue
Annotators review traces and document failure types in free-form notes
Common patterns emerge: hallucinations in specific contexts, formatting issues, inappropriate refusals
Create categorical labels for identified failure modes and re-code traces
Use failure mode distribution to prioritize fixes

Queue configuration

Labels: Free-form notes, categorical failure_type label, pass/fail rating
Annotators: Product managers, engineers, domain experts

Validating LLM-as-a-Judge evaluations

Find traces where automated evaluators may be uncertain or incorrect, then have humans provide ground truth.

Sample evaluation results: all results, or a given score/threshold
Add selected traces to an annotation queue
Annotators review traces and provide human scores for the same criteria
Compare human labels to automated evaluation scores
Identify systematic disagreements (judge too strict, too lenient, or misunderstanding criteria)
Refine evaluation prompts based on disagreements

Queue configuration

Labels: Numeric scores matching evaluation criteria (0-10), categorical judge_accuracy label, reasoning notes
Annotators: Subject matter experts who understand evaluation criteria

Golden dataset creation

Build benchmark datasets with human-verified labels for regression testing and continuous validation.

Sample diverse production traces from Trace Explorer (both good and bad examples)
Add traces to annotation queue
Annotators review and label traces across multiple quality dimensions
Add high-confidence, well-labeled examples to golden dataset
Use dataset for CI/CD regression testing of prompt changes
Continuously expand dataset with new edge cases

Queue configuration

Labels: Multiple categorical labels covering quality dimensions, numeric scores, pass/fail rating, notes
Annotators: Team of domain experts for consistency