Este producto no es compatible con el
sitio Datadog seleccionado. (
).
Overview
Preview Feature
Annotation Queues are in Preview. To request access, contact ml-observability-product@datadoghq.com.
Annotation Queues provide a structured workflow for human review of LLM traces. Use annotation queues to:
- Review traces with complete context including spans, metadata, tool calls, inputs, outputs, and evaluation results
- Apply structured labels and free-form observations to traces
- Identify and categorize failure patterns
- Validate LLM-as-a-Judge evaluation accuracy
- Build golden datasets with human-verified labels for testing and validation
Create and use an annotation queue
Navigate to AI Observability > Experiment > Annotations and select your project.
Click Create Queue.
On the About tab, configure:
- Name: Descriptive name reflecting the queue’s purpose (for example, “Failed Evaluations Review - Q1 2026”)
- Project: LLM Observability project this queue belongs to
- Description (optional): Explain the queue’s purpose and any special instructions for annotators
Then click Next.
On the Schema tab, define your new queue’s label schema. Use the Preview pane to see how labels appear to annotators as you configure them.
Then click Create.
Add interactions.
- Navigate to AI Observability > Traces.
- Filter traces using available facets (evaluation results, error status, application, time range).
- Click on an individual trace, or bulk select multiple traces.
- Click Flag for Annotation.
- Select your queue from the drop-down.
Return to the annotation queue you created, and click Review to begin annotating.
Managing queues
Editing queue schema
You can modify a queue’s label schema after creation:
- Navigate to AI Observability > Experiment > Annotations
- Open the queue
- If the Details panel is hidden, click View Details
- Click Edit
- Add, remove, or modify labels
- Click Save Changes
Changing the schema doesn't affect already-applied labels, but annotators will see the updated schema going forward.
Exporting annotated data
Export annotated traces for analysis or use in other workflows:
- Navigate to AI Observability > Experiment > Annotations
- Open the queue
- Select traces (or select all)
- Click Export
Adding to datasets
Transfer annotated traces to datasets for experiment evaluation:
- Navigate to AI Observability > Experiment > Annotations
- Open the queue
- Select traces to transfer
- Click Add to Dataset
- Choose an existing dataset, or create a dataset
Labels are included with each trace as metadata.
See Datasets for more information about using datasets in experiments.
Deleting queues
To delete a queue:
- Navigate to AI Observability > Experiment > Annotations
- Open the queue
- Click Delete in the Details panel
Deleting a queue removes the queue and label associations, but does not delete the underlying traces from LLM Observability. Traces remain accessible in Trace Explorer.
Data retention
| Data | Retention period |
|---|
| Traces in queues | 15 days |
| Annotation labels | Indefinite |
Best practices for annotation
Be consistent:
- Review the queue description and label definitions before starting
- When multiple annotators work on the same queue, establish shared understanding of criteria
- Document reasoning in notes for borderline cases
Provide reasoning:
- Use free-form notes to document why you applied specific labels
- Note patterns you observe across multiple traces
- Reasoning helps refine evaluation criteria and understand failure modes
Example workflows
Review failed traces to identify recurring patterns and categorize how your application fails in production.
- Filter traces in Trace Explorer for failed evaluations or specific error patterns
- Manually select traces and add to an annotation queue
- Annotators review traces and document failure types in free-form notes
- Common patterns emerge: hallucinations in specific contexts, formatting issues, inappropriate refusals
- Create categorical labels for identified failure modes and re-code traces
- Use failure mode distribution to prioritize fixes
Queue configuration
Labels: Free-form notes, categorical failure_type label, pass/fail rating
Annotators: Product managers, engineers, domain experts
Find traces where automated evaluators may be uncertain or incorrect, then have humans provide ground truth.
- Sample evaluation results: all results, or a given score/threshold
- Add selected traces to an annotation queue
- Annotators review traces and provide human scores for the same criteria
- Compare human labels to automated evaluation scores
- Identify systematic disagreements (judge too strict, too lenient, or misunderstanding criteria)
- Refine evaluation prompts based on disagreements
Queue configuration
Labels: Numeric scores matching evaluation criteria (0-10), categorical judge_accuracy label, reasoning notes
Annotators: Subject matter experts who understand evaluation criteria
Build benchmark datasets with human-verified labels for regression testing and continuous validation.
- Sample diverse production traces from Trace Explorer (both good and bad examples)
- Add traces to annotation queue
- Annotators review and label traces across multiple quality dimensions
- Add high-confidence, well-labeled examples to golden dataset
- Use dataset for CI/CD regression testing of prompt changes
- Continuously expand dataset with new edge cases
Queue configuration
Labels: Multiple categorical labels covering quality dimensions, numeric scores, pass/fail rating, notes
Annotators: Team of domain experts for consistency
Further Reading
Más enlaces, artículos y documentación útiles: