---
title: Session-Level Evaluations
description: >-
  Run a custom LLM-as-a-judge across an entire user session, with examples of
  when to use session scope over trace or span scope.
breadcrumbs: >-
  Docs > Agent Observability > Evaluations > Custom LLM-as-a-Judge Evaluations >
  Session-Level Evaluations
---

> For the complete documentation index, see [llms.txt](https://docs.datadoghq.com/llms.txt).

# Session-Level Evaluations

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com, us2.ddog-gov.com

{% alert level="danger" %}
This product is not supported for your selected [Datadog site](https://docs.datadoghq.com/getting_started/site.md). ({% placeholder "user-datadog-site-name" /%}).
{% /alert %}

{% /callout %}

A session-level evaluation runs once per [user session](https://docs.datadoghq.com/llm_observability/instrumentation/sdk.md#tracking-user-sessions), with every trace—and every span in those traces—available to the LLM judge in a single prompt. Sessions group related interactions under a shared `session_id` (for example, a chat conversation) and can include multiple traces over an extended interaction.

Session scope answers questions about agent performance and user behavior across an entire interaction—questions that trace-level and span-level judges cannot answer from a single request or span.

{% alert level="info" %}
Session-level evaluations require spans to be tagged with a `session_id`. See [Tracking user sessions](https://docs.datadoghq.com/llm_observability/instrumentation/sdk.md#tracking-user-sessions) to instrument your application.
{% /alert %}

## Configure a session-level evaluation{% #configure-a-session-level-evaluation %}

The walkthrough below highlights the parts of the configuration that are specific to session scope. The rest of the configuration (account, model, output type, assessment criteria) is the same as for span- or trace-scoped evaluations.

1. Navigate to the Agent Observability [Evaluations page](https://app.datadoghq.com/llm/evaluations) and select Create Evaluation, then in the `Evaluate On` select Session. (You can also start from a [template evaluation](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md).)

1. Fill in the evaluation name, account, and model as you would for any custom LLM-as-a-judge evaluation.

   {% image
      source="https://docs.dd-static.net/images/llm_observability/evaluations/session_level_evaluation_scope.07f124e85d309a9a0a128df15ad0904c.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/evaluations/session_level_evaluation_scope.07f124e85d309a9a0a128df15ad0904c.png?auto=format&fit=max&w=850&dpr=2 2x"
      alt="The Evaluate On scope picker with Session selected." /%}
Important alert (level: info): A session is considered complete after 30 minutes of inactivity (no new spans for that session, measured from the most recent span), at which point the evaluation runs. Spans that arrive more than 30 minutes after the previous span are not included in the evaluation.
1. Add a Query and Sampling Rate to control which sessions are evaluated.

1. In the System Prompt field, enter the static instructions to the LLM judge—for example, the criteria the judge should use and the output it should produce. The System Prompt does not resolve `{{ ... }}` placeholders.

1. In the User message, write the prompt that injects session data using `{{traces...}}` paths. The autocomplete dropdown adapts to session scope and lists fields available on the selected sample session. The `{{span_input}}` and `{{span_output}}` aliases are not available in session scope—reference span data through the `traces` array instead. Common patterns:

   ```gdscript3
   {{traces}}                                              # JSON of every trace in the session
   {{traces[0].spans[0].meta.input.value}}                 # First span of the first trace
   {{traces[*].spans[*].name}}                             # All span names, joined with newlines
   {{traces[*].spans[meta.span.kind:llm].meta.output.value}}  # LLM outputs across the session
   {{*}}                                                   # Entire session payload as JSON
   ```

See [Prompt Templating](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/prompt_templating.md) for the full reference.

   {% image
      source="https://docs.dd-static.net/images/llm_observability/evaluations/session_level_prompt_editor.5916588b4a8c860f2db5562bc90d4628.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/evaluations/session_level_prompt_editor.5916588b4a8c860f2db5562bc90d4628.png?auto=format&fit=max&w=850&dpr=2 2x"
      alt="The User prompt editor for a session-level evaluation, with the autocomplete dropdown listing traces-prefixed fields after typing two open braces." /%}

1. Pick a sample session from the panel on the right. The pane lists the traces in that session, with the fields referenced by your prompt highlighted.

   {% image
      source="https://docs.dd-static.net/images/llm_observability/evaluations/session_level_sample_session_trace_view.b66120acc653edd7a5390d2b4161d7eb.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/evaluations/session_level_sample_session_trace_view.b66120acc653edd7a5390d2b4161d7eb.png?auto=format&fit=max&w=850&dpr=2 2x"
      alt="The configuration page in session scope, with the sample session pane on the right showing traces and highlighted span fields." /%}

1. Click Test Evaluation to run the prompt against the selected session and preview the LLM judge's output before saving.

1. Continue with the rest of the [evaluation configuration](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations.md#define-the-evaluation-output) (output type, assessment criteria) and Save and Publish to start running the evaluation against new sessions.

## Session completion{% #session-completion %}

A session-level evaluation triggers after Datadog considers a session complete. A session is complete after 30 minutes of inactivity—that is, 30 minutes have passed with no new spans arriving for that session (measured from the most recent span).

When the session completes, the evaluation runs once with every trace and every span in those traces from that session available in the judge prompt. Any spans that arrive more than 30 minutes after the previous span on a session are not included in the session-level evaluation.

## View results{% #view-results %}

After a session completes, its evaluation result is attached to the session and is available across Agent Observability in near-real-time. While the session is still within its 30-minute inactivity window, the result shows up as Pending in the side panel; after the session completes, the pending row is replaced by the final result.

Unfold the Session evaluations on a session to see every evaluation that ran for it, alongside the LLM judge's reasoning when Enable Reasoning was turned on at configuration time. The reasoning explains *why* the judge produced that value and references specific trace or span fields it relied on—use it to triage individual failures and decide whether to refine the prompt or accept the verdict.

{% image
   source="https://docs.dd-static.net/images/llm_observability/evaluations/session_level_eval_results.1855163b54a835d31aa2b02441d99d30.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/evaluations/session_level_eval_results.1855163b54a835d31aa2b02441d99d30.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="A session detail panel with the Session evaluations section expanded. The table lists eight evaluations — including goal completeness, toxicity, topic relevancy, tool selection, sentiment, and prompt injection — each with an outcome value shown as a colored badge (such as True, Not Toxic, or On Topic) and a preview of the LLM judge's reasoning." /%}

## Example prompts{% #example-prompts %}

### Session goal completeness{% #session-goal-completeness %}

Score whether the user accomplished what they came to do across the entire session, including follow-up turns in separate traces.

**System Prompt**

```
You are evaluating an LLM chatbot session. You will see every trace in the session, including all user messages and assistant responses across turns.

Decide whether the user's goals were fully met by the end of the session. Consider:
- All distinct intents the user expressed during the session
- Whether follow-up questions indicate unresolved needs
- Whether the final state of the conversation leaves the user satisfied

Respond with one of: completed, partially_completed, failed.
```

**User**

```
Session traces:
{{traces}}
```

The managed [Goal Completeness](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md#goal-completeness) template evaluation implements this pattern.

### Multi-turn conversation quality{% #multi-turn-conversation-quality %}

Evaluate coherence, context retention, and tone across the full session rather than a single exchange.

**System Prompt**

```
You will see a multi-turn chat session between a user and an assistant across multiple traces.

Evaluate the session as a whole on:
- Coherence across turns
- Whether the assistant remembered relevant context from earlier turns
- Whether tone and helpfulness stayed consistent

Output one of: excellent, good, mixed, poor.
```

**User**

```
User and assistant messages across the session:
{{traces[*].spans[meta.span.kind:llm].meta.input.messages[*].content}}
{{traces[*].spans[meta.span.kind:llm].meta.output.messages[*].content}}
```

### User behavior and frustration signals{% #user-behavior-and-frustration-signals %}

Detect behavioral patterns that only emerge when viewing the full session.

**System Prompt**

```
Analyze this user session for signs of frustration, confusion, or abandonment.

Look for:
- Repeated or rephrased questions on the same topic
- Explicit expressions of dissatisfaction
- The user stopping after an incomplete or unhelpful answer

Output one of: no_issues, mild_frustration, high_frustration, abandoned.
```

**User**

```
Full session:
{{traces}}
```

### Agent consistency across a session{% #agent-consistency-across-a-session %}

Check whether the agent maintained quality and policy compliance across every turn in the session.

**System Prompt**

```
You will see all traces from one agent session. Assess whether the agent performed consistently:

- Did later turns contradict earlier correct answers?
- Did the agent recover from errors, or repeat the same mistake?
- Were safety and policy guidelines followed on every turn?

Respond with: consistent, mixed, inconsistent.
```

**User**

```
Session traces (chronological):
{{traces}}
```

## Choosing the right scope{% #choosing-the-right-scope %}

| Scope   | What the judge sees                 | Typical blind spot                               |
| ------- | ----------------------------------- | ------------------------------------------------ |
| Span    | One span's input and output         | No cross-span or cross-trace context             |
| Trace   | All spans in one trace              | No prior or later turns in the same chat session |
| Session | All traces (and spans) in a session | —                                                |

Use Session scope when the evaluation needs context from more than one trace in the same user session:

- User satisfaction — whether the session as a whole met the user's intent, not just the last reply.
- Multi-turn coherence — whether the assistant stayed on topic, maintained tone, and carried forward relevant context across turns that live in different traces.
- User behavior over time — patterns such as frustration, confusion, topic switching, or giving up before the agent finished helping.
- Agent performance across a session — consistency, regression after tool failures, or whether the agent recovered from mistakes in a later turn.

Use Trace scope when the answer depends on steps within a single request—for example, tool-call ordering, RAG faithfulness within one workflow run, or goal completion for one agent invocation. See [Trace-Level Evaluations](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/trace_level_evaluations.md).

Use Span scope when the evaluation can be answered from one span in isolation—for example, scoring a single LLM response, classifying intent on one message, or validating tool arguments on one call.

## Permissions{% #permissions %}

Configuring evaluations requires the `Agent Observability Write` [permission](https://docs.datadoghq.com/account_management/rbac/permissions.md#llm-observability).

## Further Reading{% #further-reading %}

- [Custom LLM-as-a-Judge Evaluations](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations.md)
- [Trace-Level Evaluations](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/trace_level_evaluations.md)
- [Prompt Templating](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/prompt_templating.md)
- [Tracking user sessions](https://docs.datadoghq.com/llm_observability/instrumentation/sdk.md#tracking-user-sessions)