Session-Level Evaluations
This product is not supported for your selected
Datadog site. (
).
A session-level evaluation runs once per user session, with every trace—and every span in those traces—available to the LLM judge in a single prompt. Sessions group related interactions under a shared session_id (for example, a chat conversation) and can include multiple traces over an extended interaction.
Session scope answers questions about agent performance and user behavior across an entire interaction—questions that trace-level and span-level judges cannot answer from a single request or span.
Session-level evaluations require spans to be tagged with a
session_id. See
Tracking user sessions to instrument your application.
The walkthrough below highlights the parts of the configuration that are specific to session scope. The rest of the configuration (account, model, output type, assessment criteria) is the same as for span- or trace-scoped evaluations.
Navigate to the Agent Observability Evaluations page and select Create Evaluation, then in the Evaluate On select Session. (You can also start from a template evaluation.)
Fill in the evaluation name, account, and model as you would for any custom LLM-as-a-judge evaluation.
A session is considered complete after 30 minutes of inactivity (no new spans for that session, measured from the most recent span), at which point the evaluation runs. Spans that arrive more than 30 minutes after the previous span are not included in the evaluation.
Add a Query and Sampling Rate to control which sessions are evaluated.
In the System Prompt field, enter the static instructions to the LLM judge—for example, the criteria the judge should use and the output it should produce. The System Prompt does not resolve {{ ... }} placeholders.
In the User message, write the prompt that injects session data using {{traces...}} paths. The autocomplete dropdown adapts to session scope and lists fields available on the selected sample session. The {{span_input}} and {{span_output}} aliases are not available in session scope—reference span data through the traces array instead. Common patterns:
{{traces}} # JSON of every trace in the session
{{traces[0].spans[0].meta.input.value}} # First span of the first trace
{{traces[*].spans[*].name}} # All span names, joined with newlines
{{traces[*].spans[meta.span.kind:llm].meta.output.value}} # LLM outputs across the session
{{*}} # Entire session payload as JSON
See Prompt Templating for the full reference.
Pick a sample session from the panel on the right. The pane lists the traces in that session, with the fields referenced by your prompt highlighted.
Click Test Evaluation to run the prompt against the selected session and preview the LLM judge’s output before saving.
Continue with the rest of the evaluation configuration (output type, assessment criteria) and Save and Publish to start running the evaluation against new sessions.
Session completion
A session-level evaluation triggers after Datadog considers a session complete. A session is complete after 30 minutes of inactivity—that is, 30 minutes have passed with no new spans arriving for that session (measured from the most recent span).
When the session completes, the evaluation runs once with every trace and every span in those traces from that session available in the judge prompt. Any spans that arrive more than 30 minutes after the previous span on a session are not included in the session-level evaluation.
View results
After a session completes, its evaluation result is attached to the session and is available across Agent Observability in near-real-time. While the session is still within its 30-minute inactivity window, the result shows up as Pending in the side panel; after the session completes, the pending row is replaced by the final result.
Unfold the Session evaluations on a session to see every evaluation that ran for it, alongside the LLM judge’s reasoning when Enable Reasoning was turned on at configuration time. The reasoning explains why the judge produced that value and references specific trace or span fields it relied on—use it to triage individual failures and decide whether to refine the prompt or accept the verdict.
Example prompts
Session goal completeness
Score whether the user accomplished what they came to do across the entire session, including follow-up turns in separate traces.
System Prompt
You are evaluating an LLM chatbot session. You will see every trace in the session, including all user messages and assistant responses across turns.
Decide whether the user's goals were fully met by the end of the session. Consider:
- All distinct intents the user expressed during the session
- Whether follow-up questions indicate unresolved needs
- Whether the final state of the conversation leaves the user satisfied
Respond with one of: completed, partially_completed, failed.
User
Session traces:
{{traces}}
The managed Goal Completeness template evaluation implements this pattern.
Multi-turn conversation quality
Evaluate coherence, context retention, and tone across the full session rather than a single exchange.
System Prompt
You will see a multi-turn chat session between a user and an assistant across multiple traces.
Evaluate the session as a whole on:
- Coherence across turns
- Whether the assistant remembered relevant context from earlier turns
- Whether tone and helpfulness stayed consistent
Output one of: excellent, good, mixed, poor.
User
User and assistant messages across the session:
{{traces[*].spans[meta.span.kind:llm].meta.input.messages[*].content}}
{{traces[*].spans[meta.span.kind:llm].meta.output.messages[*].content}}
User behavior and frustration signals
Detect behavioral patterns that only emerge when viewing the full session.
System Prompt
Analyze this user session for signs of frustration, confusion, or abandonment.
Look for:
- Repeated or rephrased questions on the same topic
- Explicit expressions of dissatisfaction
- The user stopping after an incomplete or unhelpful answer
Output one of: no_issues, mild_frustration, high_frustration, abandoned.
User
Agent consistency across a session
Check whether the agent maintained quality and policy compliance across every turn in the session.
System Prompt
You will see all traces from one agent session. Assess whether the agent performed consistently:
- Did later turns contradict earlier correct answers?
- Did the agent recover from errors, or repeat the same mistake?
- Were safety and policy guidelines followed on every turn?
Respond with: consistent, mixed, inconsistent.
User
Session traces (chronological):
{{traces}}
Choosing the right scope
| Scope | What the judge sees | Typical blind spot |
|---|
| Span | One span’s input and output | No cross-span or cross-trace context |
| Trace | All spans in one trace | No prior or later turns in the same chat session |
| Session | All traces (and spans) in a session | — |
Use Session scope when the evaluation needs context from more than one trace in the same user session:
- User satisfaction — whether the session as a whole met the user’s intent, not just the last reply.
- Multi-turn coherence — whether the assistant stayed on topic, maintained tone, and carried forward relevant context across turns that live in different traces.
- User behavior over time — patterns such as frustration, confusion, topic switching, or giving up before the agent finished helping.
- Agent performance across a session — consistency, regression after tool failures, or whether the agent recovered from mistakes in a later turn.
Use Trace scope when the answer depends on steps within a single request—for example, tool-call ordering, RAG faithfulness within one workflow run, or goal completion for one agent invocation. See Trace-Level Evaluations.
Use Span scope when the evaluation can be answered from one span in isolation—for example, scoring a single LLM response, classifying intent on one message, or validating tool arguments on one call.
Permissions
Configuring evaluations requires the Agent Observability Write permission.
Further Reading
Additional helpful documentation, links, and articles: