This product is not supported for your selected
Datadog site. (
).
A trace-level evaluation runs once per trace, with every span in the trace available to the LLM judge in a single prompt. This is the right choice when the answer to a question depends on the interaction between spans—for example, whether an agent reached its goal, whether tools were called in the correct order, or whether a multi-turn conversation stayed on topic.
Span-level evaluations, by contrast, run once per matching span and only see that span’s input and output.
When to use trace scope over span scope
Use Trace scope when the evaluation needs context from more than one span:
- The answer depends on a sequence of steps (a tool call followed by an LLM response that uses the tool’s output).
- The judge has to look at a final answer in the context of intermediate reasoning.
- A “pass” depends on something that happened earlier in the trace (a retrieval, a guardrail, a refusal).
Use Span scope when the evaluation can be answered from a single span in isolation—for example, scoring the quality of one LLM response, classifying user intent on the root span, or checking that a tool received well-formed arguments.
Use cases and examples
1. Goal completion for a multi-step agent
Score whether an agent finished what the user asked for, given the full sequence of tool calls and reasoning steps.
System Prompt
You are evaluating an AI agent that helps users complete tasks. You will see the full trace of an agent run, including every LLM call, tool invocation, and intermediate response.
Your job is to decide whether the agent achieved the user's goal. Consider:
- Did the agent understand the user's request?
- Did the tool calls and reasoning steps actually progress toward the goal?
- Is the final response a complete answer, or does it leave the request unfinished?
Respond with one of: completed, partially_completed, failed.
User
// {{spans[0].meta.input.value}} → input of the trace's first span (the user goal).
User goal: {{spans[0].meta.input.value}}
// {{spans}} → JSON of every span in the trace, in order.
Agent steps:
{{spans}}
Check whether an agent called the right tool with the right arguments for the user’s question.
System Prompt
You will see a trace where an LLM agent decides which tool to call and with what arguments. Decide whether the chosen tool and its arguments were appropriate for the user's request.
Score from 1 (clearly wrong tool or arguments) to 5 (perfect tool choice and arguments).
User
// First span's input—the user's original question.
User question: {{spans[0].meta.input.value}}
// `[meta.span.kind:tool]` filter → keeps only spans whose kind is "tool".
// `.meta.input.parameters` → the arguments passed to each matching tool call.
Tool calls made during this trace:
{{spans[meta.span.kind:tool].meta.input.parameters}}
// `[*]` wildcard → fan out across every span; outputs are joined with newlines.
Final response:
{{spans[*].meta.output.value}}
3. Faithfulness in a RAG workflow
Check whether the final answer is grounded in the documents that were retrieved earlier in the trace.
System Prompt
You will see a trace from a retrieval-augmented generation pipeline. The retrieval step provides context documents, and a downstream LLM step produces an answer.
Decide whether the final answer is supported by the retrieved documents. Mark `true` only if every factual claim in the answer can be traced back to one of the documents.
User
// `[meta.span.kind:retrieval]` filter → only retrieval spans (the documents fetched).
// `.meta.output.documents[*].text` → the text of every document the retrieval returned,
// joined with newlines.
Retrieved context:
{{spans[meta.span.kind:retrieval].meta.output.documents[*].text}}
// `[meta.span.kind:llm]` filter → outputs of the LLM call(s) that produced the answer.
Final answer:
{{spans[meta.span.kind:llm].meta.output.value}}
4. Conversation quality across turns
Score the overall quality of a multi-turn conversation, factoring in coherence across turns rather than the quality of any single response.
System Prompt
You will see a multi-turn conversation between a user and an assistant. Evaluate the conversation as a whole on:
- Coherence across turns
- Whether the assistant remembered relevant context from earlier turns
- Whether the assistant's tone stayed consistent
Output one of: excellent, good, mixed, poor.
User
// `meta.input.messages[*].content` → fans out over each LLM span's input messages
// and joins their content with newlines.
Conversation:
{{spans[meta.span.kind:llm].meta.input.messages[*].content}}
// Same pattern, but on the LLM output messages—the assistant's replies.
Assistant responses:
{{spans[meta.span.kind:llm].meta.output.messages[*].content}}
How trace completion works
A trace-level evaluation triggers after Datadog considers a trace complete. A trace is complete after 3 minutes of inactivity—that is, three minutes have passed with no new spans arriving for that trace.
Any spans that arrive more than 3 minutes after the previous span on a trace are not included in the trace-level evaluation. If your application emits long-running agents whose steps are sparser than 3 minutes apart, plan for those late spans to be excluded.
The walkthrough below highlights the parts of the configuration that are specific to trace scope. The rest of the configuration (account, model, output type, assessment criteria) is the same as for span-scoped evaluations.
Navigate to the LLM Observability Evaluations page and select Create Evaluation, then Create your own. (You can also start from a template evaluation.)
Fill in the evaluation name, account, and model as you would for any custom LLM-as-a-judge evaluation.
Under Evaluation Scope > Evaluate On, select Trace.
A trace is considered complete after 3 minutes of inactivity (no new spans for that trace), at which point the evaluation runs. Spans that arrive more than 3 minutes after the previous span are not included in the evaluation.
Add a Query and Sampling Rate to control which traces are evaluated. The query is matched against the trace’s root span—for example, @name:agent.workflow evaluates only traces whose root span is named agent.workflow.
In the System Prompt field, enter the static instructions to the LLM judge—for example, the criteria the judge should use and the output it should produce. The System Prompt does not resolve {{ ... }} placeholders.
In the User message, write the prompt that injects trace data using {{spans...}} paths. The autocomplete dropdown adapts to trace scope and lists every field available on the selected sample trace. The {{span_input}} and {{span_output}} aliases are not available in trace scope—reference span data through the spans array instead. Common patterns:
{{spans}} # JSON of every span in the trace
{{spans[0].meta.input.value}} # First span only
{{spans[*].name}} # All span names, joined with newlines
{{spans[name:my-span].meta.input.value}} # Filter spans by attribute
{{spans[meta.span.kind:llm].meta.output.value}} # All LLM-kind spans' outputs
{{*}} # Entire trace payload as JSON
See Prompt Templating for the full reference.
Pick a sample trace from the panel on the right. The pane title becomes Spans in Selected Trace and renders the spans of that trace, with the fields referenced by your prompt highlighted.
Click Test Evaluation to run the prompt against the selected trace and preview the LLM judge’s output before saving.
Continue with the rest of the evaluation configuration (output type, assessment criteria) and Save and Publish to start running the evaluation against new traces.
Viewing results
After a trace completes, its evaluation result is attached to the trace itself and is available across LLM Observability in near-real-time. While the trace is still within its 3-minute inactivity window, the result shows up as Pending in the side panel; after the trace completes, the pending row is replaced by the final result.
Query results
Trace-level evaluation results use the same query syntax as span-level evaluations. Use these patterns in the LLM Observability Traces explorer, in dashboards, and in monitor queries:
| Query | Purpose |
|---|
@evaluation.<evaluation_name>.value:complete | Filter to traces with a specific evaluation value |
@evaluation.<evaluation_name>.assessment:fail | Filter to traces that failed your evaluation’s pass criteria |
@evaluation.<evaluation_name>.value:* | All traces that have a result for this evaluator (excludes pending) |
Substitute <evaluation_name> with the name you set when creating the evaluator. Evaluation values can also be used as facets for grouping in dashboards and monitors.
Debug results
Open the Evaluations tab on a trace to see every evaluation that ran for it, alongside the LLM judge’s reasoning when Enable Reasoning was turned on at configuration time. The reasoning explains why the judge produced that value and references specific span fields it relied on—use it to triage individual failures and decide whether to refine the prompt or accept the verdict.
Monitor results
Wire trace-level evaluation results into monitors and annotation queues to alert on regressions and route failures for human review:
Monitor on pass-rate drop. Create a monitor with a query like the following to alert when the rolling pass rate drops:
formula(100 * a / b) < 80
where a = count(@evaluation.<evaluation_name>.assessment:pass) by {ml_app}
b = count(@evaluation.<evaluation_name>.value:*) by {ml_app}
over last_15m
Route failures to an annotation queue. Configure an Automation Rule that matches @evaluation.<evaluation_name>.assessment:fail and adds the trace to an annotation queue for a human reviewer. This closes the loop on judge mistakes—failed traces are reviewed, mislabels are corrected, and the corrections become the dataset you use to refine the evaluator prompt.
Permissions
Configuring evaluations requires the LLM Observability Write permission.
Further Reading
Additional helpful documentation, links, and articles: