Custom LLM-as-a-Judge Evaluations

Documentos > Observabilidad de LLM > Evaluaciones > Custom LLM-as-a-Judge Evaluations

Esta página aún no está disponible en español. Estamos trabajando en su traducción.
Si tienes alguna pregunta o comentario sobre nuestro actual proyecto de traducción, no dudes en ponerte en contacto con nosotros.

Custom LLM-as-a-judge evaluations use an LLM to judge the performance of another LLM. You can define evaluation logic with natural language prompts, capture subjective or objective criteria (like tone, helpfulness, or factuality), and run these evaluations at scale across your traces and spans.

Create a custom LLM-as-a-judge evaluation

You can create and manage custom evaluations from the Evaluations page in LLM Observability. You can start from scratch or use and build on existing template LLM-as-a-judge evaluations we provide.

Learn more about the compatibility requirements.

Configure the prompt

In Datadog, navigate to the LLM Observability Evaluations page. Select Create Evaluation, then select Create your own.
Provide a clear, descriptive evaluation name (for example, factuality-check or tone-eval). You can use this name when querying evaluation results. The name must be unique within your application.
Use the Account drop-down menu to select the LLM provider and corresponding account to use for your LLM judge. To connect a new account, see connect an LLM provider.
- If you select an Amazon Bedrock account, choose a region the account is configured for.
- If you select a Vertex account, choose a project and location.
Use the Model drop-down menu to select a model to use for your LLM judge.
Under Evaluation Scope, select the application you want to evaluate.
Under Evaluation Prompt section, use the Prompt Template drop-down menu:
- Create from scratch: Use your own custom prompt (defined in the next step).
- Failure to Answer, Prompt Injection, Sentiment, etc.: Populate a pre-existing prompt template. You can use these templates as-is, or modify them to match your specific evaluation logic.
In the System Prompt field, enter your custom prompt or modify a prompt template. For custom prompts, provide clear instructions describing what the evaluator should assess.
- Focus on a single evaluation goal
- Include 2–3 few-shot examples showing input/output pairs, expected results, and reasoning.

Example custom prompt

System Prompt

You will be looking at interactions between a user and a budgeting AI agent. Your job is to classify the user's intent when it comes to using the budgeting AI agent.

You will be given a Span Input, which represents the user's message to the agent, which you will then classify. Here are some examples.

Span Input: What are the core things I should know about budgeting?
Classification: general_financial_advice

Span Input: Did I go over budget with my grocery bills last month?
Classification: budgeting_question

Span Input: What is the category for which I have the highest budget?
Classification: budgeting_question

Span Input: Based on my past months, what is my ideal budget for subscriptions?
Classification: budgeting_advice

Span Input: Raise my restaurant budget by $50
Classification: budgeting_request

Span Input: Help me plan a trip to the Maldives
Classification: unrelated

User

Span Input: {{span_input}}

In the User field, provide your user prompt. Explicitly specify what parts of the span to evaluate. You can reference any span attribute, such as Span Input ({{span_input}}), Output ({{span_output}}), or any other span field. An autocomplete dropdown appears when you type {{ to help you select available fields.
Additional variables are available: type {{ to see the full list. You may also use Filtered Spans or Filtered Traces (on the right side) to add span data as a variable:
1. Choose an account and an application so that spans/traces show up on the right.
2. Select one of the spans on the right to view its JSON.
3. Use the three-dots menu and select Add variable to message to insert the JSON into your prompt.

The menu contents of the JSON view in the custom evaluation configuration right pane, displaying the option to Add variable to message.

Define the evaluation output

For OpenAI, Azure OpenAI, Vertex AI, Anthropic, or Amazon Bedrock models, configure Structured Output.

For Anthropic or Amazon Bedrock models, you can alternatively configure Keyword Search Output.

For AI Gateway, both Structured Output and Keyword Search Output are supported. Datadog recommends using Structured Output when your model supports it, and falling back to Keyword Search Output otherwise.

Structured Output (OpenAI, Azure OpenAI, Anthropic, Amazon Bedrock, AI Gateway, Vertex AI)

Select an evaluation output type:
- Boolean: True/false results (for example, “Did the model follow instructions?”)
- Score: Numeric ratings (for example, a 1–5 scale for helpfulness)
- Categorical: Discrete labels (for example, “Good”, “Bad”, “Neutral”)
- JSON: JSON allows free form schemas
Optionally, select Enable Reasoning. This configures the LLM judge to provide a short justification for its decision (for example, why a score of 8 was given). Reasoning helps you understand how and why evaluations are made, and is particularly useful for auditing subjective metrics like tone, empathy, or helpfulness. Adding reasoning can also make the LLM judge more accurate.
Edit a JSON schema that defines your evaluations output type:

For the Boolean output type, edit the description field to further explain what true and false mean in your use case.

For the Score output type:

Set a min and max score for your evaluation.
Edit the description field to further explain the scale of your evaluation.

For the Categorical output type:

Add or remove categories by editing the JSON schema.
Edit category names.
Edit the description field of categories to further explain what they mean in the context of your evaluation.

An example schema for a categorical evaluation:

{
    "name": "categorical_eval",
    "schema": {
        "type": "object",
        "required": [
            "categorical_eval",
            "reasoning"
        ],
        "properties": {
            "categorical_eval": {
                "type": "string",
                "anyOf": [
                    {
                        "const": "budgeting_question",
                        "description": "The user is asking a question about their budget. The answer can be directly determined by looking at their budget and spending."
                    },
                    {
                        "const": "budgeting_request",
                        "description": "The user is asking to change something about their budget. This should involve an action that changes their budget."
                    },
                    {
                        "const": "budgeting_advice",
                        "description": "The user is asking for advice on their budget. This should not require a change to their budget, but it should require an analysis of their budget and spending."
                    },
                    {
                        "const": "general_financial_advice",
                        "description": "The user is asking for general financial advice which is not directly related to their specific budget. However, this can include advice about budgeting in general."
                    },
                    {
                        "const": "unrelated",
                        "description": "This is a catch-all category for things not related to budgeting or financial advice."
                    }
                ]
            },
            "reasoning": {
                "type": "string",
                "description": "Describe how you decided the category"
            }
        },
        "additionalProperties": false
    },
    "strict": true
}

For the JSON output type, define a free form JSON schema to capture complex, structured evaluation outputs.

An example schema for a JSON evaluation:

{
    "name": "json_eval",
    "schema": {
        "type": "object",
        "required": [
            "result",
            "reasoning"
        ],
        "properties": {
            "result": {
                "type": "object",
                "description": "The structured evaluation result",
                "properties": {
                    "is_compliant": {
                        "type": "boolean",
                        "description": "Whether the response meets compliance requirements"
                    },
                    "confidence_score": {
                        "type": "number",
                        "description": "Confidence level of the evaluation from 0 to 1"
                    },
                    "issue_count": {
                        "type": "integer",
                        "description": "Number of issues identified in the response"
                    }
                },
                "required": ["is_compliant", "confidence_score", "issue_count"],
                "additionalProperties": false
            },
            "reasoning": {
                "type": "string",
                "description": "Describe the reasoning behind your evaluation"
            }
        },
        "additionalProperties": false
    },
    "strict": true
}

Configure Assessment Criteria. This flexibility allows you to align evaluation outcomes with your team’s quality bar. Pass/fail mapping also powers automation across Datadog LLM Observability, enabling monitors and dashboards to flag regressions or track overall health.

Select True to mark a result as “Pass”, or False to mark a result as “Fail”.

Define numerical thresholds to determine passing performance.

Select the categories that should map to a passing state. For example, if you have the categories Excellent, Good, and Poor, where only Poor should correspond to a failing state, select Excellent and Good.

Assessment Criteria is not currently available for JSON evaluations.

Keyword Search Output (Anthropic, Amazon Bedrock, AI Gateway)

Select the Boolean output type.
For Keyword Search Output, only the Boolean output type is available.
Provide True keywords and False keywords that define when the evaluation result is true or false, respectively.
Datadog searches the LLM-as-a-judge’s response text for your defined keywords and provides the appropriate results for the evaluation. For this reason, you should instruct the LLM to respond with your chosen keywords.
For example, if you set:
- True keywords: Yes, yes
- False keywords: No, no
Then your system prompt should include something like Respond with "yes" or "no".
For Assessment Criteria:
- Select True to mark a result as “Pass”
- Select False to mark a result as “Fail”
This flexibility allows you to align evaluation outcomes with your team’s quality bar. Pass/fail mapping also powers automation across Datadog LLM Observability, enabling monitors and dashboards to flag regressions or track overall health.

Configuring the custom evaluation output under Structured Output, including reasoning and assessment criteria.

Define the evaluation scope: Filtering and sampling

Under Evaluation Scope, define where and how your evaluation runs. This helps control coverage (which spans are included) and cost (how many spans are sampled).

Application: Select the application you want to evaluate.
Evaluate On: Choose one of the following:
- Traces: Evaluate only root spans
- All Spans: Evaluate both root and child spans
Span Names: (Optional) Limit evaluation to spans with certain names.
Tags: (Optional) Limit evaluation to spans with certain tags.
Sampling Rate: (Optional) Apply sampling (for example, 10%) to control evaluation cost.

Test and preview

The pane on the right shows Filtered Spans (or traces) corresponding to the configured evaluation scope.

Select a span to show JSON data available for use in an evaluation. Then, click Test Evaluation to pre-fill inputs to your evaluation with data from the span, and click Run to test.

Viewing and using results

After you Save and Publish your evaluation, Datadog automatically runs your evaluation on targeted spans. Alternatively, you can Save as Draft and edit or enable your evaluation later.

Results are available across LLM Observability in near-real-time for published evaluations. You can find your custom LLM-as-a-judge results for a specific span in the Evaluations tab, alongside other evaluations.

The Evaluations tab of a trace, displaying custom evaluation results alongside managed evaluations.

Each evaluation result includes:

The evaluated value (for example True, 9, or Neutral)
The reasoning (when enabled)
The pass/fail indicator (based on your assessment criteria)

Use the syntax @evaluations.custom.<evaluation_name> to query or visualize results.

For example:

@evaluations.custom.helpfulness-check

The LLM Observability Traces view. In the search box, the user has entered `@evaluations.custom.budget-guru-intent-classifier:budgeting_question` and results are populated below.

You can:

Filter traces by evaluation results (example, @evaluations.custom.helpfulness-check)
Filter by pass/fail assessment status (example, @evaluations.assessment.custom.helpfulness-check:fail)
Use evaluation results as facets
View aggregate results in the LLM Observability Overview page’s Evaluation section
Create monitors to alert on performance changes or regression

Best practices for reliable custom evaluations

Start small: Target a single, well-defined failure mode before scaling.
Enable reasoning when you need explainable decisions and to improve the accuracy on complex reasoning tasks.
Iterate: Run, inspect outputs, and refine your prompt.
Validate: Periodically check evaluator accuracy using sampled traces.
Document your rubric: Clearly define what “Pass” and “Fail” mean to avoid drift over time.
Re-align your evaluator: Reassess prompt and few-shot examples when the underlying LLM updates.