Este producto no es compatible con el sitio Datadog seleccionado. ().
Esta página aún no está disponible en español. Estamos trabajando en su traducción. Si tienes alguna pregunta o comentario sobre nuestro actual proyecto de traducción, no dudes en ponerte en contacto con nosotros.
Custom LLM-as-a-judge evaluations use an LLM to judge the performance of another LLM. You can define evaluation logic with natural language prompts, capture subjective or objective criteria (like tone, helpfulness, or factuality), and run these evaluations at scale across your traces and spans.
Create a custom LLM-as-a-judge evaluation
You can create and manage custom evaluations from the Evaluations page in LLM Observability.
In Datadog, navigate to the LLM Observability Evaluations page. Select Create Evaluation, then select Create your own.
Provide a clear, descriptive evaluation name (for example, factuality-check or tone-eval). You will use this name when querying evaluation results. The name must be unique within your application.
Use the Account drop-down menu to select the LLM provider and corresponding account to use for your LLM judge. To connect a new account, see connect an LLM provider.
Use the Model drop-down menu to select a model to use for your LLM judge.
Under Evaluation Prompt section, use the Prompt Template drop-down menu:
Create from scratch: Use your own custom prompt (defined in the next step).
Failure to Answer, Prompt Injection, Sentiment, etc.: Populate a pre-existing prompt template. You can use these templates as-is, or modify them to match your specific evaluation logic.
In the System Prompt field, enter your custom prompt or modify a prompt template.
For custom prompts, provide clear instructions describing what the evaluator should assess.
Focus on a single evaluation goal
Include 2–3 few-shot examples showing input/output pairs, expected results, and reasoning.
Example custom prompt
System Prompt
You will be looking at interactions between a user and a budgeting AI agent. Your job is to classify the user's intent when it comes to using the budgeting AI agent.
You will be given a Span Input, which represents the user's message to the agent, which you will then classify. Here are some examples.
Span Input: What are the core things I should know about budgeting?
Classification: general_financial_advice
Span Input: Did I go over budget with my grocery bills last month?
Classification: budgeting_question
Span Input: What is the category for which I have the highest budget?
Classification: budgeting_question
Span Input: Based on my past months, what is my ideal budget for subscriptions?
Classification: budgeting_advice
Span Input: Raise my restaurant budget by $50
Classification: budgeting_request
Span Input: Help me plan a trip to the Maldives
Classification: unrelated
User
Span Input: {{span_input}}
In the User field, provide your user prompt. Explicitly specify what parts of the span to evaluate: Span Input ({{span_input}}), Output ({{span_output}}), or both.
Select an evaluation output type:
Boolean: True/false results (for example, “Did the model follow instructions?”)
Score: Numeric ratings (for example, a 1–5 scale for helpfulness)
Categorical: Discrete labels (for example, “Good”, “Bad”, “Neutral”)
For Anthropic and Amazon Bedrock models, only the Boolean output type is available.
Define the structure of your output.
Edit a JSON schema that defines your evaluations output type.
Boolean
Edit the description field to further explain what true and false mean in your use case.
Score
Set a min and max score for your evaluation.
Edit the description field to further explain the scale of your evaluation.
Categorical
Add or remove categories by editing the JSON schema
Edit category names
Edit the description field of categories to further explain what they mean in the context of your evaluation.
An example schema for a categorical evaluation:
{"name":"categorical_eval","schema":{"type":"object","required":["categorical_eval"],"properties":{"categorical_eval":{"type":"string","anyOf":[{"const":"budgeting_question","description":"The user is asking a question about their budget. The answer can be directly determined by looking at their budget and spending."},{"const":"budgeting_request","description":"The user is asking to change something about their budget. This should involve an action that changes their budget."},{"const":"budgeting_advice","description":"The user is asking for advice on their budget. This should not require a change to their budget, but it should require an analysis of their budget and spending."},{"const":"general_financial_advice","description":"The user is asking for general financial advice which is not directly related to their specific budget. However, this can include advice about budgeting in general."},{"const":"unrelated","description":"This is a catch-all category for things not related to budgeting or financial advice."}]}},"additionalProperties":false},"strict":true}
Edit a JSON schema that defines your evaluations output type.
Boolean
Edit the description field to further explain what true and false mean in your use case.
Score
Set a min and max score for your evaluation.
Edit the description field to further explain the scale of your evaluation.
Categorical
Add or remove categories by editing the JSON schema
Edit category names
Edit the description field of categories to further explain what they mean in the context of your evaluation.
An example schema for a categorical evaluation:
{"name":"categorical_eval","schema":{"type":"object","required":["categorical_eval"],"properties":{"categorical_eval":{"type":"string","anyOf":[{"const":"budgeting_question","description":"The user is asking a question about their budget. The answer can be directly determined by looking at their budget and spending."},{"const":"budgeting_request","description":"The user is asking to change something about their budget. This should involve an action that changes their budget."},{"const":"budgeting_advice","description":"The user is asking for advice on their budget. This should not require a change to their budget, but it should require an analysis of their budget and spending."},{"const":"general_financial_advice","description":"The user is asking for general financial advice which is not directly related to their specific budget. However, this can include advice about budgeting in general."},{"const":"unrelated","description":"This is a catch-all category for things not related to budgeting or financial advice."}]}},"additionalProperties":false},"strict":true}
Provide True keywords and False keywords that define when the evaluation result is true or false, respectively.
Datadog searches the LLM-as-a-judge’s response text for your defined keywords and provides the appropriate results for the evaluation. For this reason, you should instruct the LLM to respond with your chosen keywords.
For example, if you set:
True keywords: Yes, yes
False keywords: No, no
Then your system prompt should include something like Respond with "yes" or "no".
Provide True keywords and False keywords that define when the evaluation result is true or false, respectively.
Datadog searches the LLM-as-a-judge’s response text for your defined keywords and provides the appropriate results for the evaluation. For this reason, you should instruct the LLM to respond with your chosen keywords.
For example, if you set:
True keywords: Yes, yes
False keywords: No, no
Then your system prompt should include something like Respond with "yes" or "no".
Under Evaluation Scope, define the scope of your evaluation:
Application: Select the application you want to evaluate.
Evaluate On: Choose one of the following:
Traces: Evaluate only root spans
All Spans: Evaluate both root and child spans
Span Names: (Optional) Limit evaluation to spans with certain names.
Tags: (Optional) Limit evaluation to spans with certain tags.
Sampling Rate: (Optional) Apply sampling (for example, 10%) to control evaluation cost.
Use the Test Evaluation panel on the right to preview how your evaluator performs. You can input sample {{span_input}} and {{span_output}} values, then click Run Evaluation to see the LLM-as-a-Judge’s output before saving. Modify your evaluation until you are satisfied with the results.
Viewing and using results
After you save your evaluation, Datadog automatically runs your evaluation on targeted spans. Results are available across LLM Observability in near-real-time. You can find your custom LLM-as-a-judge results for a specific span in the Evaluations tab, next to all other evaluations.
Use the syntax @evaluations.custom.<evaluation_name> to query or visualize results.