---
title: Custom LLM-as-a-Judge Evaluations
description: >-
  How to create custom LLM-as-a-judge evaluations, and how to use these
  evaluation results across Agent Observability.
breadcrumbs: Docs > Agent Observability > Evaluations > Custom LLM-as-a-Judge Evaluations
---

> For the complete documentation index, see [llms.txt](https://docs.datadoghq.com/llms.txt).

# Custom LLM-as-a-Judge Evaluations

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com, us2.ddog-gov.com

{% alert level="danger" %}
This product is not supported for your selected [Datadog site](https://docs.datadoghq.com/getting_started/site.md). ({% placeholder "user-datadog-site-name" /%}).
{% /alert %}

{% /callout %}

Custom LLM-as-a-judge evaluations use an LLM to judge the performance of another LLM. Define evaluation logic with natural language prompts, capture subjective or objective criteria (like tone, helpfulness, or factuality), and run the evaluations at scale on:

- **Span scope**—score the input and output of one LLM call, agent step, or tool invocation in isolation.
- **Trace scope**—feed every span of a trace to the LLM judge in a single prompt, so the evaluation can reason across steps. See [Trace-Level Evaluations](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/trace_level_evaluations.md) for the full walkthrough, use cases, and prompt examples.
- **Session scope**—feed every trace in a user session (and every span in those traces) to the LLM judge in a single prompt, so the evaluation can reason across an entire multi-turn interaction. See [Session-Level Evaluations](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/session_level_evaluations.md) for the full walkthrough, use cases, and prompt examples.

## Create a custom LLM-as-a-judge evaluation{% #create-a-custom-llm-as-a-judge-evaluation %}

You can create and manage custom evaluations from the [Evaluations page](https://app.datadoghq.com/llm/evaluations) in Agent Observability. You can provide an evaluation description to generate an evaluation, use and build on existing [template LLM-as-a-judge evaluations](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md) we provide, or start from scratch. You can enable tracing to see traces from your evaluations.

{% alert level="info" %}
If you already have an `LLMJudge` defined in the SDK, you can publish it directly to Datadog without rebuilding the configuration in the UI. See [Publishing an LLMJudge as a Datadog managed evaluation](https://docs.datadoghq.com/llm_observability/guide/evaluation_developer_guide.md#publishing-an-llmjudge-as-a-datadog-managed-evaluation).
{% /alert %}

Learn more about the [compatibility requirements](https://docs.datadoghq.com/llm_observability/evaluations/evaluation_compatibility.md).

### Configure the prompt{% #configure-the-prompt %}

1. In Datadog, navigate to the Agent Observability [Evaluations page](https://app.datadoghq.com/llm/evaluations). Select Create Evaluation, then select Create your own.
   {% image
      source="https://docs.dd-static.net/images/llm_observability/evaluations/EvalConfig_LLMO_1.4b1669f8ef7a26c8924798c036c7d01f.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/evaluations/EvalConfig_LLMO_1.4b1669f8ef7a26c8924798c036c7d01f.png?auto=format&fit=max&w=850&dpr=2 2x"
      alt="The Agent Observability Evaluations page after selecting Create Evaluation." /%}
1. To enable tracing for evaluations, click the Tracing Disabled button, then select the Trace Evaluations toggle to enable tracing. When this evaluation runs, its traces appear under `datadog-evaluations`, giving you greater visibility into your evaluations. **Note**: Enabling tracing increases the number of billed spans sent to Datadog.
   {% image
      source="https://docs.dd-static.net/images/llm_observability/evaluations/evaluation_tracing_enabled.6d03d342acb93e2a0f9ee18520da3a01.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/evaluations/evaluation_tracing_enabled.6d03d342acb93e2a0f9ee18520da3a01.png?auto=format&fit=max&w=850&dpr=2 2x"
      alt="Trace Evaluations enabled after the toggle to enable evaluation tracing has been selected." /%}
1. Provide a clear, descriptive evaluation name (for example, `factuality-check` or `tone-eval`). You can use this name when querying evaluation results. The name must be unique within your application.
1. Configure the model:
   1. Select the Account dropdown menu to select the LLM provider and corresponding account to use for your LLM judge. To connect a new account, see [connect an LLM provider](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account.md).
      - If you select an Amazon Bedrock account, choose a region the account is configured for. You can then select a model name or provide the inference profile ARN.
      - If you select a Vertex account, choose a project and location.
   1. Use the Model dropdown menu to select a model.
1. In Runs On, select the application you want to evaluate, what you want to evaluate on (span, trace, or session), and the sampling rate. You can add more filtering criteria by selecting the button to the right of the sampling rate.
1. In the Template section, use the dropdown menu:
   - Create from scratch: Use your own custom prompt (defined in the next step).
   - Failure to Answer, Prompt Injection, Sentiment, etc.: Populate a pre-existing prompt template. You can use these templates as-is, or modify them to match your specific evaluation logic.
1. In the System Prompt field, enter your custom prompt or modify a prompt template. For custom prompts, provide clear instructions describing what the evaluator should assess.
   - Focus on a single evaluation goal
   - Include 2-3 few-shot examples showing input/output pairs, expected results, and reasoning.

{% collapsible-section #custom-prompt-example %}
#### Example custom prompt

**System Prompt**

```
You will be looking at interactions between a user and a budgeting AI agent. Your job is to classify the user's intent when it comes to using the budgeting AI agent.

You will be given a Span Input, which represents the user's message to the agent, which you will then classify. Here are some examples.

Span Input: What are the core things I should know about budgeting?
Classification: general_financial_advice

Span Input: Did I go over budget with my grocery bills last month?
Classification: budgeting_question

Span Input: What is the category for which I have the highest budget?
Classification: budgeting_question

Span Input: Based on my past months, what is my ideal budget for subscriptions?
Classification: budgeting_advice

Span Input: Raise my restaurant budget by $50
Classification: budgeting_request

Span Input: Help me plan a trip to the Maldives
Classification: unrelated
```

**User**

```
Span Input: {{span_input}}
```

{% /collapsible-section %}

In the User Prompt field, specify what parts of the span, trace, or session to evaluate by adding variables. You can add any span attribute, such as Span Input (`{{span_input}}`), Output (`{{span_output}}`), or any other span field. For trace-scoped evaluations, use `{{spans...}}` paths to read across spans; for session-scoped evaluations, use `{{traces...}}` paths to read across traces. See [Prompt Templating](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/prompt_templating.md) for the full reference. To edit the user prompt directly, select it and edit the text.

You may also use the panel on the right (Filtered Spans in span scope, Filtered Traces in trace scope, Filtered Sessions in session scope) to add span data as a variable:

1. Choose an account and an application so that spans, traces, or sessions show up on the right.
1. Select one of the spans on the right to view its JSON.
1. Select + to add the JSON to your user prompt.

{% image
   source="https://docs.dd-static.net/images/llm_observability/evaluations/custom_llm_judge_2-5.266ab1d0548a47fb5f5145b7cd9d88ff.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/evaluations/custom_llm_judge_2-5.266ab1d0548a47fb5f5145b7cd9d88ff.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="The menu contents of the JSON view in the custom evaluation configuration right pane, displaying the option to Add variable to message." /%}

### Define the evaluation output{% #define-the-evaluation-output %}

For OpenAI, Azure OpenAI, Vertex AI, Anthropic, or Amazon Bedrock models, configure Structured Output.

For Anthropic or Amazon Bedrock models, you can alternatively configure Keyword Search Output.

For AI Gateway, both Structured Output and Keyword Search Output are supported. Datadog recommends using Structured Output when your model supports it, and falling back to Keyword Search Output otherwise.

{% collapsible-section open=null #structured-output %}
#### Structured Output (OpenAI, Azure OpenAI, Anthropic, Amazon Bedrock, AI Gateway, Vertex AI)

1. Select an evaluation output type:

   - Boolean: True/false results (for example, "Did the model follow instructions?")
   - Score: Numeric ratings (for example, a 1-5 scale for helpfulness)
   - Categorical: Discrete labels (for example, "Good", "Bad", "Neutral")
   - JSON: JSON allows free form schemas

1. Optionally, select Enable Reasoning. This configures the LLM judge to provide a short justification for its decision (for example, why a score of 8 was given). Reasoning helps you understand how and why evaluations are made, and is particularly useful for auditing subjective metrics like tone, empathy, or helpfulness. Adding reasoning can also [make the LLM judge more accurate](https://arxiv.org/abs/2504.00050).

1. Edit a JSON schema that defines your evaluations output type:

{% tab title="Boolean" %}
For the **Boolean** output type, edit the `description` field to further explain what true and false mean in your use case.
{% /tab %}

{% tab title="Score" %}
For the **Score** output type:

- Set a `min` and `max` score for your evaluation.
- Edit the `description` field to further explain the scale of your evaluation.

{% /tab %}

{% tab title="Categorical" %}
For the **Categorical** output type:

- Add or remove categories by editing the JSON schema.
- Edit category names.
- Edit the `description` field of categories to further explain what they mean in the context of your evaluation.

An example schema for a categorical evaluation:

```gdscript3
{
    "name": "categorical_eval",
    "schema": {
        "type": "object",
        "required": [
            "categorical_eval",
            "reasoning"
        ],
        "properties": {
            "categorical_eval": {
                "type": "string",
                "anyOf": [
                    {
                        "const": "budgeting_question",
                        "description": "The user is asking a question about their budget. The answer can be directly determined by looking at their budget and spending."
                    },
                    {
                        "const": "budgeting_request",
                        "description": "The user is asking to change something about their budget. This should involve an action that changes their budget."
                    },
                    {
                        "const": "budgeting_advice",
                        "description": "The user is asking for advice on their budget. This should not require a change to their budget, but it should require an analysis of their budget and spending."
                    },
                    {
                        "const": "general_financial_advice",
                        "description": "The user is asking for general financial advice which is not directly related to their specific budget. However, this can include advice about budgeting in general."
                    },
                    {
                        "const": "unrelated",
                        "description": "This is a catch-all category for things not related to budgeting or financial advice."
                    }
                ]
            },
            "reasoning": {
                "type": "string",
                "description": "Describe how you decided the category"
            }
        },
        "additionalProperties": false
    },
    "strict": true
}
```

{% /tab %}

{% tab title="JSON" %}
For the **JSON** output type, define a free form JSON schema to capture complex, structured evaluation outputs.

An example schema for a JSON evaluation:

```
{
    "name": "json_eval",
    "schema": {
        "type": "object",
        "required": [
            "result",
            "reasoning"
        ],
        "properties": {
            "result": {
                "type": "object",
                "description": "The structured evaluation result",
                "properties": {
                    "is_compliant": {
                        "type": "boolean",
                        "description": "Whether the response meets compliance requirements"
                    },
                    "confidence_score": {
                        "type": "number",
                        "description": "Confidence level of the evaluation from 0 to 1"
                    },
                    "issue_count": {
                        "type": "integer",
                        "description": "Number of issues identified in the response"
                    }
                },
                "required": ["is_compliant", "confidence_score", "issue_count"],
                "additionalProperties": false
            },
            "reasoning": {
                "type": "string",
                "description": "Describe the reasoning behind your evaluation"
            }
        },
        "additionalProperties": false
    },
    "strict": true
}
```

{% /tab %}
Configure Assessment Criteria. This flexibility allows you to align evaluation outcomes with your team's quality bar. Pass/fail mapping also powers automation across Datadog Agent Observability, enabling monitors and dashboards to flag regressions or track overall health.
{% tab title="Boolean" %}
Select True to mark a result as "Pass", or False to mark a result as "Fail".
{% /tab %}

{% tab title="Score" %}
Define numerical thresholds to determine passing performance.
{% /tab %}

{% tab title="Categorical" %}
Select the categories that should map to a passing state. For example, if you have the categories `Excellent`, `Good`, and `Poor`, where only `Poor` should correspond to a failing state, select `Excellent` and `Good`.
{% /tab %}

{% tab title="JSON" %}
Supply a JavaScript function to assign an assessment based on the output from the LLM-as-a-Judge evaluator. The function must return a json object of the following format

```
{
    assessment: "pass", // "pass" | "fail" [REQUIRED],
    value: "evaluation_label" // string [OPTIONAL],
    reasoning: "explanation behind the assessment" // string [OPTIONAL]

}
```

and the function signature must be `function __evalPostProcessing(input)` and the `input` is the json from the evaluator. The function below is an example of a post processing function:

```gdscript3
function __evalPostProcessing(input) {
    /*
     * Expected input shape (from LLM evaluator [this depends on the JSON Structured Output]):
     * {
     *   criteria: {
     *     quality_score: { score: number (0–1), category: "excellent"|"good"|"poor", reasoning: string },
     *     toxicity:      { score: number (0–1), category: "safe"|"unsafe",           reasoning: string },
     *     completeness:  { score: number (0–1), category: "complete"|"incomplete",   reasoning: string },
     *     relevance:     { score: number (0–1), category: "relevant"|"irrelevant",   reasoning: string },
     *   },
     *   overall_reasoning: string  // (optional) top-level summary from LLM evaluator
     * }
     */

    const SCORE_THRESHOLD = 0.7;

    // Category → pass/fail mappings per criterion
    const CATEGORY_PASS_MAP = {
        quality_score: ["excellent", "good"],
        toxicity:      ["safe"],
        completeness:  ["complete"],
        relevance:     ["relevant"],
    };

    const criteriaResults = {};
    const failures = [];
    const passes = [];

    for (const [criterionName, passCategories] of Object.entries(CATEGORY_PASS_MAP)) {
        const criterion = input?.criteria?.[criterionName];

        if (!criterion) {
            failures.push(`[${criterionName}] Missing from evaluator output.`);
            criteriaResults[criterionName] = false;
            continue;
        }

        const { score, category, reasoning } = criterion;

        const scorePass    = typeof score === "number" && score >= SCORE_THRESHOLD;
        const categoryPass = typeof category === "string" && passCategories.includes(category.toLowerCase());

        // Both score AND category must pass
        const criterionPass = scorePass && categoryPass;
        criteriaResults[criterionName] = criterionPass;

        if (criterionPass) {
            passes.push(`[${criterionName}] PASS — score: ${score.toFixed(2)}, category: "${category}". ${reasoning ?? ""}`);
        } else {
            const reasons = [];
            if (!scorePass)    reasons.push(`score ${score?.toFixed(2) ?? "N/A"} below threshold (≥${SCORE_THRESHOLD})`);
            if (!categoryPass) reasons.push(`category "${category}" not in acceptable set [${passCategories.join(", ")}]`);
            failures.push(`[${criterionName}] FAIL — ${reasons.join("; ")}. ${reasoning ?? ""}`);
        }
    }

    // Determine overall assessment
    const passed = Object.values(criteriaResults).every(Boolean);
    const failCount = failures.length;

    const assessment = passed ? "pass" : "fail";

    const label = passed
        ? "high_quality_response"
        : failCount === 1
            ? "minor_quality_issue"
            : failCount === 2
                ? "moderate_quality_issue"
                : "low_quality_response";

    const reasoningParts = [
        passed
            ? "All criteria passed."
            : `${failCount} criterion/criteria failed.`,
        ...failures,
        ...passes,
        input?.overall_reasoning ? `Evaluator summary: ${input.overall_reasoning}` : ""
    ].filter(Boolean);

    return {
        assessment: assessment,
        value: label,
        reasoning: reasoningParts.join(" | ")
    };
}
```

{% /tab %}

{% /collapsible-section %}

{% collapsible-section open=null #post-processing %}
#### Post-Processing (OpenAI, Azure OpenAI, Anthropic, Amazon Bedrock, AI Gateway, Vertex AI)

1. Select the JSON output type.

1. Provide a JavaScript function to identify the evaluator's assessment, value, and reasoning. Post-processing enables you conduct a more complex assessment than just using Boolean, Score, or Categorical structured output.

The post-processing function must return an object containing an **assessment** of value "pass" or "fail" and optionally, value or reasoning strings. The function must return a json object of the following format:

   ```
   {
       assessment: "pass", // "pass" | "fail" [REQUIRED],
       value: "evaluation_label" // string [OPTIONAL],
       reasoning: "explanation behind the assessment" // string [OPTIONAL]
   
   }
   ```

and the function signature must be `function __evalPostProcessing(input)` and the `input` is the json from the evaluator. The function below is an example of a post processing function:

   ```gdscript3
   function __evalPostProcessing(input) {
       /*
       * Expected input shape (from LLM evaluator [this depends on the JSON Structured Output]):
       * {
       *   criteria: {
       *     quality_score: { score: number (0–1), category: "excellent"|"good"|"poor", reasoning: string },
       *     toxicity:      { score: number (0–1), category: "safe"|"unsafe",           reasoning: string },
       *     completeness:  { score: number (0–1), category: "complete"|"incomplete",   reasoning: string },
       *     relevance:     { score: number (0–1), category: "relevant"|"irrelevant",   reasoning: string },
       *   },
       *   overall_reasoning: string  // (optional) top-level summary from LLM evaluator
       * }
       */
   
       const SCORE_THRESHOLD = 0.7;
   
       // Category → pass/fail mappings per criterion
       const CATEGORY_PASS_MAP = {
           quality_score: ["excellent", "good"],
           toxicity:      ["safe"],
           completeness:  ["complete"],
           relevance:     ["relevant"],
       };
   
       const criteriaResults = {};
       const failures = [];
       const passes = [];
   
       for (const [criterionName, passCategories] of Object.entries(CATEGORY_PASS_MAP)) {
           const criterion = input?.criteria?.[criterionName];
   
           if (!criterion) {
               failures.push(`[${criterionName}] Missing from evaluator output.`);
               criteriaResults[criterionName] = false;
               continue;
           }
   
           const { score, category, reasoning } = criterion;
   
           const scorePass    = typeof score === "number" && score >= SCORE_THRESHOLD;
           const categoryPass = typeof category === "string" && passCategories.includes(category.toLowerCase());
   
           // Both score AND category must pass
           const criterionPass = scorePass && categoryPass;
           criteriaResults[criterionName] = criterionPass;
   
           if (criterionPass) {
               passes.push(`[${criterionName}] PASS — score: ${score.toFixed(2)}, category: "${category}". ${reasoning ?? ""}`);
           } else {
               const reasons = [];
               if (!scorePass)    reasons.push(`score ${score?.toFixed(2) ?? "N/A"} below threshold (≥${SCORE_THRESHOLD})`);
               if (!categoryPass) reasons.push(`category "${category}" not in acceptable set [${passCategories.join(", ")}]`);
               failures.push(`[${criterionName}] FAIL — ${reasons.join("; ")}. ${reasoning ?? ""}`);
           }
       }
   
       // Determine overall assessment
       const passed = Object.values(criteriaResults).every(Boolean);
       const failCount = failures.length;
   
       const assessment = passed ? "pass" : "fail";
   
       const label = passed
           ? "high_quality_response"
           : failCount === 1
               ? "minor_quality_issue"
               : failCount === 2
                   ? "moderate_quality_issue"
                   : "low_quality_response";
   
       const reasoningParts = [
           passed
               ? "All criteria passed."
               : `${failCount} criterion/criteria failed.`,
           ...failures,
           ...passes,
           input?.overall_reasoning ? `Evaluator summary: ${input.overall_reasoning}` : ""
       ].filter(Boolean);
   
       return {
           assessment: assessment,
           value: label,
           reasoning: reasoningParts.join(" | ")
       };
   }
   ```

{% /collapsible-section %}

{% collapsible-section open=null #keyword-search-output %}
#### Keyword Search Output (Anthropic, Amazon Bedrock, AI Gateway)

1. Select the Boolean output type.
Important alert (level: info): For Keyword Search Output, only the **Boolean** output type is available.
1. Provide True keywords and False keywords that define when the evaluation result is true or false, respectively.

Datadog searches the LLM-as-a-judge's response text for your defined keywords and provides the appropriate results for the evaluation. For this reason, you should instruct the LLM to respond with your chosen keywords.

For example, if you set:

   - True keywords: Yes, yes
   - False keywords: No, no

Then your system prompt should include something like `Respond with "yes" or "no"`.

1. For Assessment Criteria:

   - Select True to mark a result as "Pass"
   - Select False to mark a result as "Fail"

This flexibility allows you to align evaluation outcomes with your team's quality bar. Pass/fail mapping also powers automation across Datadog Agent Observability, enabling monitors and dashboards to flag regressions or track overall health.

{% /collapsible-section %}

{% image
   source="https://docs.dd-static.net/images/llm_observability/evaluations/custom_llm_judge_5-2.5338b9c6c94dc911a3ab08afbbabdcb4.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/evaluations/custom_llm_judge_5-2.5338b9c6c94dc911a3ab08afbbabdcb4.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="Configuring the custom evaluation output under Structured Output, including reasoning and assessment criteria." /%}

### Define the evaluation scope: Filtering and sampling{% #define-the-evaluation-scope-filtering-and-sampling %}

{% alert level="info" %}
Span fields used in evaluations are limited to 250 KB each. Fields exceeding this size are truncated before being sent to the LLM judge.
{% /alert %}

Under Evaluation Scope, define where and how your evaluation runs. This helps control coverage (which spans or traces are included) and cost (how many are sampled).

- Application: Select the application you want to evaluate.
- Evaluate On: Choose one of the following:
  - Trace: Evaluate the full trace, including all its spans, as a single unit. Use this when the answer depends on context across multiple spans (agent goal completion, tool-use chains, RAG faithfulness). See [Trace-Level Evaluations](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/trace_level_evaluations.md) for examples and details on how trace completion is determined.
  - Span: Evaluate matching spans individually. Use the Query field to scope to specific spans (for example, only root spans, only `llm` spans, or spans with a specific tag).
  - Session: Evaluate an entire user session, including every trace and its spans, as a single unit. Use this when the answer depends on context across multiple traces in the same session (user satisfaction, multi-turn coherence, or user behavior over time). Requires spans tagged with a `session_id`. See [Session-Level Evaluations](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/session_level_evaluations.md) for examples and details on how session completion is determined.
- Query: (Optional) Enter a query using Datadog query syntax to filter which spans or traces are evaluated. For example:
  - `@name:agent.workflow` to filter by span name
  - `env:prod` to filter by tag
  - `@parent_id:undefined` to evaluate only root spans (when Evaluate On is set to Span)
  - `@name:agent.workflow AND env:prod` to filter by span name and tag
- Sampling Rate: (Optional) Apply sampling (for example, 10%) to control evaluation cost.

{% image
   source="https://docs.dd-static.net/images/llm_observability/evaluations/evaluation_scope_1.6a41ee0a408b760a525d819fd75f5f9a.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/evaluations/evaluation_scope_1.6a41ee0a408b760a525d819fd75f5f9a.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="Configuring the evaluation scope." /%}

### Test and preview{% #test-and-preview %}

The pane on the right shows Filtered Spans (or traces) corresponding to the configured evaluation scope.

Select a span to show JSON data available for use in an evaluation. Then, click Test Evaluation to pre-fill inputs to your evaluation with data from the span, and click Run to test.

## Viewing and using results{% #viewing-and-using-results %}

After you Save and Publish your evaluation, Datadog automatically runs your evaluation on targeted spans. Alternatively, you can Save as Draft and edit or enable your evaluation later.

Results are available across Agent Observability in near-real-time for published evaluations. You can find your custom LLM-as-a-judge results for a specific span in the Evaluations tab, alongside other evaluations.

{% image
   source="https://docs.dd-static.net/images/llm_observability/evaluations/custom_llm_judge_3-2.34af6339cac79cdac731158437a839a6.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/evaluations/custom_llm_judge_3-2.34af6339cac79cdac731158437a839a6.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="The Evaluations tab of a trace, displaying custom evaluation results alongside managed evaluations." /%}

Each evaluation result includes:

- The evaluated value (for example `True`, `9`, or `Neutral`)
- The reasoning (when enabled)
- The pass/fail indicator (based on your assessment criteria)

Use the syntax `@evaluation.<evaluation_name>.value` to query or visualize results.

For example:

```
@evaluation.helpfulness-check.value
```

{% image
   source="https://docs.dd-static.net/images/llm_observability/evaluations/custom_llm_judge_4.159277f798924b9e8f1d971e7c1e532c.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/llm_observability/evaluations/custom_llm_judge_4.159277f798924b9e8f1d971e7c1e532c.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="The Agent Observability Traces view. In the search box, the user has entered `@evaluation.budget-guru-intent-classifier.value:budgeting_question` and results are populated below." /%}

You can:

- Filter traces by evaluation results (example, `@evaluation.helpfulness-check.value`)
- Filter by pass/fail assessment status (example, `@evaluation.helpfulness-check.assessment:fail`)
- Use evaluation results as [facets](https://docs.datadoghq.com/events/explorer/facets.md)
- View aggregate results in the Agent Observability Overview page's Evaluation section
- Create [monitors](https://docs.datadoghq.com/monitors.md) to alert on performance changes or regression

## Using in experiments{% #using-in-experiments %}

To reuse a custom LLM-as-a-judge evaluation in a local [LLM Experiment](https://docs.datadoghq.com/llm_observability/experiments.md), reference it by name using `RemoteEvaluator` from the SDK:

```python
from ddtrace.llmobs import LLMObs, RemoteEvaluator

evaluator = RemoteEvaluator(eval_name="quality-assessment")

experiment = LLMObs.experiment(
    name="my-experiment",
    task=my_task,
    dataset=dataset,
    evaluators=[evaluator],
)
experiment.run()
```

You can mix `RemoteEvaluator` with other local evaluators in the same experiment. For custom input mapping, error handling, and more options, see [RemoteEvaluator](https://docs.datadoghq.com/llm_observability/guide/evaluation_developer_guide.md#using-managed-evaluators) in the Evaluation Developer Guide.

## Best practices for reliable custom evaluations{% #best-practices-for-reliable-custom-evaluations %}

- **Start small**: Target a single, well-defined failure mode before scaling.
- **Enable reasoning** when you need explainable decisions and to improve the accuracy on complex reasoning tasks.
- **Iterate**: Run, inspect outputs, and refine your prompt.
- **Validate**: Periodically check evaluator accuracy using sampled traces.
- **Document your rubric**: Clearly define what "Pass" and "Fail" mean to avoid drift over time.
- **Re-align your evaluator**: Reassess prompt and few-shot examples when the underlying LLM updates.

## Estimated token usage{% #estimated-token-usage %}

You can monitor the token usage of your LLM evaluations using the [LLM Evaluations Token Usage dashboard](https://app.datadoghq.com/dash/integration/llm_evaluations_token_usage).

If you need more details, the following metrics allow you to track the LLM resources consumed to power evaluations:

- `ml_obs.estimated_usage.llm.input.tokens`
- `ml_obs.estimated_usage.llm.output.tokens`
- `ml_obs.estimated_usage.llm.total.tokens`

Each of these metrics has `ml_app`, `model_server`, `model_provider`, `model_name`, and `evaluation_name` tags, allowing you to pinpoint specific applications, models, and evaluations contributing to your usage.

## Configure LLM-as-a-judge evaluations from the API{% #configure-llm-as-a-judge-evaluations-from-the-api %}

You can use basic CRUD operations to manipulate managed evaluation configs, after you have the `DD_API_KEY` [API key](https://docs.datadoghq.com/account_management/api-app-keys.md) specified in your environment.

- [GET](https://docs.datadoghq.com/api/latest/llm-observability.md#get-a-custom-evaluator-configuration) existing evaluation configurations
- [PUT](https://docs.datadoghq.com/api/latest/llm-observability.md#create-or-update-a-custom-evaluator-configuration) existing evaluation configurations
- [DELETE](https://docs.datadoghq.com/api/latest/llm-observability.md#delete-a-custom-evaluator-configuration) existing evaluation configurations

## Further Reading{% #further-reading %}

- [Driving AI ROI: How Datadog connects cost, performance, and infrastructure so you can scale responsibly](https://www.datadoghq.com/blog/manage-ai-cost-and-performance-with-datadog/)
- [Gain visibility into Strands Agents workflows with Datadog LLM Observability](https://www.datadoghq.com/blog/llm-aws-strands)
- [Building an LLM evaluation framework: best practices](https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/)
- [Learn about Agent Observability terms and concepts](https://docs.datadoghq.com/llm_observability/terms.md)
- [Learn how to set up Agent Observability](https://docs.datadoghq.com/llm_observability/setup.md)
- [Learn about managed evaluations](https://docs.datadoghq.com/llm_observability/evaluations/managed_evaluations.md)
- [Using LLM-as-a-judge for an automated and versatile evaluation](https://huggingface.co/learn/cookbook/llm_judge)