---
title: Evaluation Developer Guide
description: Learn how to build custom evaluators using the LLM Observability SDK.
breadcrumbs: >-
  Docs > LLM Observability > LLM Observability Guides > Evaluation Developer
  Guide
---

# Evaluation Developer Guide

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com

{% alert level="danger" %}
This product is not supported for your selected [Datadog site](https://docs.datadoghq.com/getting_started/site). ().
{% /alert %}

{% /callout %}

## Overview{% #overview %}

This guide covers how to build custom evaluators with the LLM Observability SDK and use them in LLM Experiments and in production.

## Key concepts{% #key-concepts %}

An **evaluation** measures a specific quality of your LLM application's output, such as accuracy, tone, or harmfulness. You write the evaluation logic inside an **evaluator**, which receives context about the LLM interaction and returns a result.

### Running evaluators in an Experiment{% #running-evaluators-in-an-experiment %}

To test your LLM application against a dataset before deploying, run your evaluators in [LLM Experiments](https://docs.datadoghq.com/llm_observability/experiments). In Experiments, evaluators run automatically: the SDK calls your evaluator on each distinct record. Use evaluators through the SDK.

### Running evaluators in production{% #running-evaluators-in-production %}

To monitor the quality of your live LLM responses, run evaluators in production. You can run evaluators manually with `submit_evaluation()`, or automatically with [custom LLM-as-a-judge evaluations](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations). Use evaluators through the SDK, HTTP API, or the Datadog UI.

For production, there are two approaches:

- **Manual evaluations** (this guide): You run evaluators in your application code and submit results with `LLMObs.submit_evaluation()` or the HTTP API. This gives you full control over evaluation logic and timing.
- **Custom LLM-as-a-judge evaluations**: You configure evaluations in the Datadog UI using natural language prompts. Datadog automatically runs them on production traces in real time, with no code changes required.

This guide focuses on manual evaluations. For managed LLM-as-a-judge evaluations, see [Custom LLM-as-a-Judge Evaluations](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations).

### Evaluation components{% #evaluation-components %}

The evaluation system has four main components:

- **EvaluatorContext**: The input to an evaluator. Contains the LLM's input, output, expected output, and span identifiers. In Experiments, the SDK builds this automatically from each dataset record. In production, you construct the EvaluatorContext yourself.
- **EvaluatorResult**: The output of an evaluator. Contains a typed value, optional reasoning, a pass/fail assessment, metadata, and tags. You can also return a plain value (`str`, `float`, `int`, `bool`, `dict`) instead.
- **Metric type**: Determines how the evaluation value is interpreted and displayed: `categorical` (string labels), `score` (numeric), `boolean` (pass/fail), or `json` (structured data).
- **SummaryEvaluatorContext** — Experiments only. After all dataset records are evaluated, summary evaluators receive the aggregated results to compute statistics like averages or pass rates.

The typical flow:

- **Experiments**: Dataset record → `EvaluatorContext` → Evaluator → `EvaluatorResult` → (after all records) `SummaryEvaluatorContext` → Summary evaluator → summary result
- **Production**: Span data → `EvaluatorContext` (built manually) → Evaluator → `EvaluatorResult` → `LLMObs.submit_evaluation()` or HTTP API

## Building evaluators{% #building-evaluators %}

There are two ways to define an evaluator using LLM Observability: class-based and function-based. In addition to these evaluators, LLM Observability has integrations with open source evaluation frameworks, such as [DeepEval](https://docs.datadoghq.com/llm_observability/evaluations/deepeval_evaluations/), that can be used in LLM Observability Experiments.

| Class-based                     | Function-based                                                                                                  |
| ------------------------------- | --------------------------------------------------------------------------------------------------------------- |
| **Best for**                    | Reusable evaluators with custom configuration or state.                                                         | One-off evaluators with straightforward logic.                            |
| **Receives**                    | An `EvaluatorContext` object with full span context (input, output, expected output, metadata, span/trace IDs). | `input_data`, `output_data`, and `expected_output` as separate arguments. |
| **Supports summary evaluators** | Yes (`BaseSummaryEvaluator`).                                                                                   | No.                                                                       |

If you are unsure, start with class-based evaluators. They provide the same capabilities as function-based evaluators.

### Class-based evaluators{% #class-based-evaluators %}

Class-based evaluators provide a structured way to implement reusable evaluation logic with custom configuration.

#### BaseEvaluator{% #baseevaluator %}

Subclass `BaseEvaluator` to create an evaluator that runs on a single span or dataset record. Implement the `evaluate` method, which receives an `EvaluatorContext` and returns an `EvaluatorResult` (or a plain value).

```python
from ddtrace.llmobs import BaseEvaluator, EvaluatorContext, EvaluatorResult

class SemanticSimilarityEvaluator(BaseEvaluator):
    """Evaluates semantic similarity between output and expected output."""

    def __init__(self, threshold: float = 0.8):
        super().__init__(name="semantic_similarity")
        self.threshold = threshold

    def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
        score = compute_similarity(context.output_data, context.expected_output)

        return EvaluatorResult(
            value=score,
            reasoning=f"Similarity score: {score:.2f}",
            assessment="pass" if score >= self.threshold else "fail",
            metadata={"threshold": self.threshold},
            tags={"type": "semantic"}
        )
```

- Call `super().__init__(name="evaluator_name")` to set the evaluator's label.
- Implement `evaluate(context: EvaluatorContext)` with your evaluation logic.
- Return an `EvaluatorResult` for rich results, or a plain value (`str`, `float`, `int`, `bool`, `dict`).

#### BaseSummaryEvaluator{% #basesummaryevaluator %}

{% alert level="info" %}
Summary evaluators are only available in experiments.
{% /alert %}

Subclass `BaseSummaryEvaluator` to create an evaluator that operates on the aggregated results of an entire experiment run. It receives a `SummaryEvaluatorContext` containing all inputs, outputs, and per-evaluator results.

```python
from ddtrace.llmobs import BaseSummaryEvaluator, SummaryEvaluatorContext

class AverageScoreEvaluator(BaseSummaryEvaluator):
    """Computes average score across all evaluation results."""

    def __init__(self, target_evaluator: str):
        super().__init__(name="average_score")
        self.target_evaluator = target_evaluator

    def evaluate(self, context: SummaryEvaluatorContext):
        scores = context.evaluation_results.get(self.target_evaluator, [])
        if not scores:
            return None
        return sum(scores) / len(scores)
```

- Call `super().__init__(name="evaluator_name")` to set the evaluator's label.
- Access per-evaluator results through `context.evaluation_results`, which maps evaluator names to lists of results.

### LLMJudge{% #llmjudge %}

The `LLMJudge` class enables automated evaluation of LLM outputs using another LLM as the judge. It supports OpenAI, Azure OpenAI, Anthropic, Amazon Bedrock, and custom LLM clients with structured output formats.

#### Parameters{% #parameters %}

| Parameter           | Type               | Required    | Description                                                                                                      |
| ------------------- | ------------------ | ----------- | ---------------------------------------------------------------------------------------------------------------- |
| `user_prompt`       | `str`              | Yes         | Prompt template with `{{field.path}}` syntax for span context injection.                                         |
| `system_prompt`     | `str`              | No          | System prompt to set the judge's behavior or persona.                                                            |
| `structured_output` | `StructuredOutput` | No          | Output format specification. See structured output types.                                                        |
| `provider`          | `str`              | Conditional | LLM provider: `"openai"`, `"azure_openai"`, `"anthropic"`, or `"bedrock"`. Required if `client` is not provided. |
| `model`             | `str`              | No          | Model identifier (for example, `"gpt-4o"`, `"claude-sonnet-4-20250514"`).                                        |
| `model_params`      | `dict`             | No          | Additional parameters passed to the LLM API (for example, `temperature`).                                        |
| `client`            | callable           | Conditional | Custom LLM client function. Required if `provider` is not provided.                                              |
| `name`              | `str`              | No          | Evaluator name for identification in results.                                                                    |
| `client_options`    | `dict`             | No          | Provider-specific configuration (for example, API keys).                                                         |

#### Template variables{% #template-variables %}

The `user_prompt` supports `{{field.path}}` syntax to inject context from the evaluated span. Nested paths are supported.

- `{{input_data}}` — The span's input data.
- `{{output_data}}` — The span's output data.
- `{{expected_output}}` — Expected output for comparison (if available).
- `{{metadata.key}}` — Nested metadata fields (for example, `{{metadata.topic}}`).

#### Structured output types{% #structured-output-types %}

| Output type                   | Description                                                               |
| ----------------------------- | ------------------------------------------------------------------------- |
| `BooleanStructuredOutput`     | Returns `True`/`False` with optional pass/fail assessment.                |
| `ScoreStructuredOutput`       | Returns a numeric score within a defined range, with optional thresholds. |
| `CategoricalStructuredOutput` | Returns one of a predefined set of categories, with optional pass values. |
| `Dict[str, JSONType]`         | Custom JSON schema for arbitrary structured output.                       |

All structured output types accept `reasoning=True` to include an explanation in results, and `reasoning_description` to customize the reasoning field's description.

#### Example: Boolean evaluation{% #example-boolean-evaluation %}

```python
from ddtrace.llmobs import LLMJudge, BooleanStructuredOutput

judge = LLMJudge(
    provider="openai",
    model="gpt-4o",
    user_prompt="Is this response factually accurate? Response: {{output_data}}",
    structured_output=BooleanStructuredOutput(
        description="Whether the response is factually accurate",
        reasoning=True,
        pass_when=True,
    ),
)
```

#### Example: Score-based evaluation with thresholds{% #example-score-based-evaluation-with-thresholds %}

```python
from ddtrace.llmobs import LLMJudge, ScoreStructuredOutput

judge = LLMJudge(
    provider="anthropic",
    model="claude-sonnet-4-20250514",
    user_prompt="Rate the helpfulness of this response (1-10): {{output_data}}",
    structured_output=ScoreStructuredOutput(
        description="Helpfulness score",
        min_score=1,
        max_score=10,
        reasoning=True,
        min_threshold=7,  # Scores >= 7 pass
    ),
)
```

#### Example: Categorical evaluation{% #example-categorical-evaluation %}

```python
from ddtrace.llmobs import LLMJudge, CategoricalStructuredOutput

judge = LLMJudge(
    provider="openai",
    model="gpt-4o",
    user_prompt="Classify the sentiment: {{output_data}}",
    structured_output=CategoricalStructuredOutput(
        categories={
            "positive": "The response has a positive sentiment.",
            "neutral": "The response has a neutral sentiment.",
            "negative": "The response has a negative sentiment.",
        },
        reasoning=True,
        pass_values=["positive", "neutral"],
    ),
)
```

#### Example: Azure OpenAI{% #example-azure-openai %}

```python
from ddtrace.llmobs import LLMJudge, BooleanStructuredOutput

judge = LLMJudge(
    provider="azure_openai",
    model="gpt-4o",
    user_prompt="Is this response factually accurate? Response: {{output_data}}",
    structured_output=BooleanStructuredOutput(
        description="Whether the response is factually accurate",
        reasoning=True,
        pass_when=True,
    ),
    client_options={
        "azure_endpoint": "https://your-resource.openai.azure.com",
        "api_version": "2024-10-21",
        "azure_deployment": "gpt-4o",
    },
)
```

The `azure_openai` provider accepts the following `client_options`:

| Option             | Environment variable       | Description                                           |
| ------------------ | -------------------------- | ----------------------------------------------------- |
| `api_key`          | `AZURE_OPENAI_API_KEY`     | Azure OpenAI API key.                                 |
| `azure_endpoint`   | `AZURE_OPENAI_ENDPOINT`    | Azure OpenAI endpoint URL.                            |
| `api_version`      | `AZURE_OPENAI_API_VERSION` | API version. Defaults to `"2024-10-21"`.              |
| `azure_deployment` | `AZURE_OPENAI_DEPLOYMENT`  | Deployment name. Falls back to the `model` parameter. |

#### Example: Custom LLM client{% #example-custom-llm-client %}

```python
from ddtrace.llmobs import LLMJudge, BooleanStructuredOutput

def my_llm_client(provider, messages, json_schema, model, model_params):
    response = call_my_llm(messages, model)
    return response

judge = LLMJudge(
    client=my_llm_client,
    model="my-custom-model",
    user_prompt="Is this response accurate? {{output_data}}",
    structured_output=BooleanStructuredOutput(
        description="Accuracy check",
        reasoning=True,
        pass_when=True,
    ),
)
```

#### Key points{% #key-points %}

- Requires either a `provider` (`"openai"`, `"azure_openai"`, `"anthropic"`, or `"bedrock"`) or a custom `client`.
- Set API keys using `client_options={"api_key": "..."}` or environment variables (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`). For Azure OpenAI, set `AZURE_OPENAI_API_KEY` and `AZURE_OPENAI_ENDPOINT`. For Bedrock, configure AWS credentials through environment variables or `client_options`.
- Use `reasoning=True` in structured outputs to include an explanation in results.
- Define pass/fail criteria with `pass_when` (boolean), `pass_values` (categorical), or `min_threshold`/`max_threshold` (score).

#### Publishing an LLMJudge as a Datadog managed evaluation{% #publishing-an-llmjudge-as-a-datadog-managed-evaluation %}

Use `LLMObs.publish_evaluator()` to push a locally-defined `LLMJudge` configuration to Datadog as a custom LLM-as-a-judge draft. This lets you define and validate an evaluator in experiments, then promote it to production without manually recreating the configuration in the UI.

| Parameter          | Type             | Required | Description                                                                                                      |
| ------------------ | ---------------- | -------- | ---------------------------------------------------------------------------------------------------------------- |
| `evaluator`        | `LLMJudge`       | Yes      | The `LLMJudge` instance to publish.                                                                              |
| `ml_app`           | `str`            | Yes      | The LLM application name.                                                                                        |
| `eval_name`        | `str`            | No       | The name to use for the evaluator in Datadog. If omitted, defaults to the `name` set on the `LLMJudge` instance. |
| `variable_mapping` | `dict[str, str]` | No       | Remaps variable names in `user_prompt` to Datadog span field paths in the published evaluator.                   |

```python
from ddtrace.llmobs import LLMObs
from ddtrace.llmobs._evaluators import BooleanStructuredOutput, LLMJudge

LLMObs.enable(
    ml_app="my-ml-app",
    api_key="<DD_API_KEY>",
    app_key="<DD_APP_KEY>",
)

judge = LLMJudge(
    provider="openai",
    model="gpt-4o",
    system_prompt="You are a helpful evaluator.",
    user_prompt=(
        "Does the output correctly answer the question?\n"
        "Input: {{input_data}}\n"
        "Output: {{output_data}}"
    ),
    structured_output=BooleanStructuredOutput("correctness", pass_when=True),
    name="my-correctness-judge",
)

result = LLMObs.publish_evaluator(
    judge,
    ml_app="my-ml-app",
    variable_mapping={"input_data": "span_input", "output_data": "span_output"},
)
print(result["ui_url"])
```

`LLMObs.publish_evaluator()` returns `{"ui_url": "..."}`, which links to the evaluator in Datadog.

{% alert level="info" %}
Each call to `LLMObs.publish_evaluator()` creates or updates the evaluator draft. Activate it from the Datadog UI to run it in production.
{% /alert %}

### Built-in evaluators{% #built-in-evaluators %}

The SDK provides built-in evaluators for common evaluation patterns. These are class-based evaluators that you can use directly without writing custom logic.

#### StringCheckEvaluator{% #stringcheckevaluator %}

Performs string comparison operations between `output_data` and `expected_output`.

| Operation   | Description                                                 |
| ----------- | ----------------------------------------------------------- |
| `eq`        | Exact match (default)                                       |
| `ne`        | Not equals                                                  |
| `contains`  | `output_data` contains `expected_output` (case-sensitive)   |
| `icontains` | `output_data` contains `expected_output` (case-insensitive) |

```python
from ddtrace.llmobs._evaluators import StringCheckEvaluator

# Perform an exact match (default)
evaluator = StringCheckEvaluator(operation="eq", case_sensitive=True)

# Check whether output_data contains expected_output (case-insensitive)
evaluator = StringCheckEvaluator(operation="icontains", strip_whitespace=True)

# Extract field from dict output before comparison
evaluator = StringCheckEvaluator(
    operation="eq",
    output_extractor=lambda x: x.get("message", "") if isinstance(x, dict) else str(x),
)
```

#### RegexMatchEvaluator{% #regexmatchevaluator %}

Validates output against a regex pattern.

| Match mode  | Description                                |
| ----------- | ------------------------------------------ |
| `search`    | Partial match anywhere in string (default) |
| `match`     | Match from start of string                 |
| `fullmatch` | Match entire string                        |

```python
from ddtrace.llmobs._evaluators import RegexMatchEvaluator
import re

# Validate email format
evaluator = RegexMatchEvaluator(
    pattern=r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$",
    match_mode="fullmatch"
)

# Validate output pattern (case-insensitive)
evaluator = RegexMatchEvaluator(
    pattern=r"success|completed",
    flags=re.IGNORECASE
)
```

#### LengthEvaluator{% #lengthevaluator %}

Validates output length constraints.

| Count type   | Description                |
| ------------ | -------------------------- |
| `characters` | Count characters (default) |
| `words`      | Count words                |
| `lines`      | Count lines                |

```python
from ddtrace.llmobs._evaluators import LengthEvaluator

# Ensure response is 50-200 characters
evaluator = LengthEvaluator(min_length=50, max_length=200, count_type="characters")

# Validate word count
evaluator = LengthEvaluator(min_length=10, max_length=100, count_type="words")
```

#### JSONEvaluator{% #jsonevaluator %}

Validates that output is valid JSON, and optionally checks for required keys.

```python
from ddtrace.llmobs._evaluators import JSONEvaluator

# Validate JSON syntax
evaluator = JSONEvaluator()

# Validate that required keys exist
evaluator = JSONEvaluator(required_keys=["name", "status", "data"])
```

#### SemanticSimilarityEvaluator{% #semanticsimilarityevaluator %}

Measures semantic similarity between `output_data` and `expected_output` using embeddings. Returns a similarity score between 0.0 and 1.0.

```python
from ddtrace.llmobs._evaluators import SemanticSimilarityEvaluator
from openai import OpenAI

client = OpenAI()

def get_embedding(text):
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

evaluator = SemanticSimilarityEvaluator(
    embedding_fn=get_embedding,
    threshold=0.8  # Minimum similarity score to pass
)
```

### Function-based evaluators{% #function-based-evaluators %}

For straightforward evaluation logic, define a function instead of a class. Function-based evaluators receive the input, output, and expected output directly as arguments.

```python
from ddtrace.llmobs import EvaluatorResult

def exact_match_evaluator(input_data, output_data, expected_output):
    """Checks if output exactly matches expected output."""
    matches = output_data == expected_output
    return EvaluatorResult(
        value=matches,
        reasoning="Exact match" if matches else "Output differs from expected",
        assessment="pass" if matches else "fail",
    )
```

**Function signature**:

```python
def evaluator_function(
    input_data: Any,
    output_data: Any,
    expected_output: Any
) -> Union[JSONType, EvaluatorResult]:
    ...
```

You can return either:

- A plain value (`str`, `float`, `int`, `bool`, `dict`), or
- An `EvaluatorResult` for rich results with reasoning and metadata

## Using evaluators in experiments{% #using-evaluators-in-experiments %}

Pass your evaluators to `LLMObs.experiment()` to run them against every record in a dataset. The SDK automatically builds an `EvaluatorContext` for each record and calls your evaluator. After all records are processed, any summary evaluators run on the aggregated results.

```python
from ddtrace.llmobs import LLMObs, Dataset, DatasetRecord

# Create dataset
dataset = Dataset(
    name="qa_dataset",
    records=[
        DatasetRecord(
            input_data={"question": "What is 2+2?"},
            expected_output="4"
        ),
        DatasetRecord(
            input_data={"question": "What is the capital of France?"},
            expected_output="Paris"
        ),
    ]
)

# Define task
def qa_task(input_data, config):
    return generate_answer(input_data["question"])

# Create evaluators
semantic_eval = SemanticSimilarityEvaluator(threshold=0.7)
summary_eval = AverageScoreEvaluator("semantic_similarity")

# Run experiment
experiment = LLMObs.experiment(
    name="qa_experiment",
    task=qa_task,
    dataset=dataset,
    evaluators=[semantic_eval, exact_match_evaluator],
    summary_evaluators=[summary_eval]
)

experiment.run()
```

### Using managed evaluators{% #using-managed-evaluators %}

`RemoteEvaluator` lets you reference a [custom LLM-as-a-judge evaluation](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations) configured in the Datadog UI by name, and run it as part of a local experiment. This allows you to reuse your production evaluators in offline experiments without reimplementing the evaluation logic in Python.

| Parameter      | Type                 | Description                                                                       |
| -------------- | -------------------- | --------------------------------------------------------------------------------- |
| `eval_name`    | `str`                | The name of the LLM-as-a-judge evaluator as configured in Datadog.                |
| `transform_fn` | `Optional[Callable]` | A function that maps an `EvaluatorContext` to a dict of template variable values. |

```python
from ddtrace.llmobs import LLMObs, RemoteEvaluator

evaluator = RemoteEvaluator(eval_name="quality-assessment")

experiment = LLMObs.experiment(
    name="my-experiment",
    task=my_task,
    dataset=dataset,
    evaluators=[evaluator],
)
experiment.run()
```

#### Mapping dataset data to prompt variables with `transform_fn`{% #mapping-dataset-data-to-prompt-variables-with-transform_fn %}

When you configure an LLM-as-a-judge in the Datadog UI, the [prompt template uses variables](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations#configure-the-prompt) such as `{{span_input}}` and `{{span_output}}`. By default, `RemoteEvaluator` maps the following:

- `input_data` → `span_input`
- `output_data` → `span_output`
- `expected_output` → `meta.expected_output`

If your dataset records have a different structure—for example, `input_data` is a dict with multiple keys—provide a `transform_fn` to control exactly which values are sent for each template variable:

```python
from ddtrace.llmobs import RemoteEvaluator, EvaluatorContext

def my_transform(context: EvaluatorContext) -> dict:
    # input_data is a dict: {"user_query": str, "retrieved_docs": list[str]}
    return {
        "span_input": context.input_data.get("user_query"),   # → {{span_input}} in the prompt
        "span_output": context.output_data,                   # → {{span_output}} in the prompt
        "meta": {
            "retrieved_docs": context.input_data.get("retrieved_docs"),  # → {{meta.retrieved_docs}}
        },
    }

evaluator = RemoteEvaluator(
    eval_name="quality-assessment",
    transform_fn=my_transform,
)
```

If the backend evaluator encounters an error, a `RemoteEvaluatorError` is raised. Inspect `backend_error` for details:

```python
from ddtrace.llmobs import RemoteEvaluator, RemoteEvaluatorError, EvaluatorContext

evaluator = RemoteEvaluator(eval_name="quality-assessment")
context = EvaluatorContext(input_data={"query": "What is the capital of France?"}, output_data="Paris")

try:
    result = evaluator.evaluate(context)
except RemoteEvaluatorError as e:
    print(e.backend_error)
    # {"type": "...", "message": "...", "recommended_resolution": "..."}
```

## Using evaluators in production{% #using-evaluators-in-production %}

{% alert level="info" %}
This section covers evaluations you run and submit manually from your application code. To have Datadog run evaluations automatically on production traces, see [Custom LLM-as-a-Judge Evaluations](https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations) instead.
{% /alert %}

To submit evaluations from your application code, construct the `EvaluatorContext` yourself, call the evaluator, and submit the result with `LLMObs.submit_evaluation()`. You can also submit evaluations through the HTTP API.

For the full `submit_evaluation()` arguments and span-joining options, see the [external evaluations documentation](https://docs.datadoghq.com/llm_observability/evaluations/external_evaluations). For the HTTP API specification, see the [Evaluations API reference](https://docs.datadoghq.com/llm_observability/instrumentation/api/#evaluations-api).

```python
from ddtrace.llmobs import LLMObs, EvaluatorContext
from ddtrace.llmobs.decorators import llm

evaluator = SemanticSimilarityEvaluator(threshold=0.8)

@llm(model_name="claude", name="invoke_llm", model_provider="anthropic")
def llm_call(input_text):
    completion = ...  # Your LLM application logic

    # Build the evaluation context from the span data
    context = EvaluatorContext(
        input_data=input_text,
        output_data=completion,
        expected_output=None,
    )

    # Run the evaluator
    result = evaluator.evaluate(context)

    # Submit the result to Datadog
    LLMObs.submit_evaluation(
        span=LLMObs.export_span(),
        ml_app="chatbot",
        label=evaluator.name,
        metric_type="score",
        value=result.value,
        assessment=result.assessment,
        reasoning=result.reasoning,
    )

    return completion
```

## Data model reference{% #data-model-reference %}

### EvaluatorContext{% #evaluatorcontext %}

A frozen dataclass containing all the information needed to run an evaluation.

| Field             | Type             | Description                                                        |
| ----------------- | ---------------- | ------------------------------------------------------------------ |
| `input_data`      | `Any`            | The input provided to the LLM application (for example, a prompt). |
| `output_data`     | `Any`            | The actual output from the LLM application.                        |
| `expected_output` | `Any`            | The expected or ideal output the LLM should have produced.         |
| `metadata`        | `Dict[str, Any]` | Additional metadata.                                               |
| `span_id`         | `str`            | The span's unique identifier.                                      |
| `trace_id`        | `str`            | The trace's unique identifier.                                     |

In Experiments, the SDK populates this automatically from each dataset record. In production, you construct it yourself from your span data.

### EvaluatorResult{% #evaluatorresult %}

Allows you to return rich evaluation results with additional context. Used in both Experiments and production.

| Field        | Type                                 | Description                                                              |
| ------------ | ------------------------------------ | ------------------------------------------------------------------------ |
| `value`      | `Union[str, float, int, bool, dict]` | The evaluation value. Type depends on `metric_type`.                     |
| `reasoning`  | `Optional[str]`                      | A text explanation of the evaluation result.                             |
| `assessment` | `Optional[str]`                      | An assessment of this evaluation. Accepted values are `pass` and `fail`. |
| `metadata`   | `Optional[Dict[str, Any]]`           | Additional metadata about the evaluation.                                |
| `tags`       | `Optional[Dict[str, str]]`           | Tags to apply to the evaluation metric.                                  |

### SummaryEvaluatorContext{% #summaryevaluatorcontext %}

A frozen dataclass providing aggregated evaluation results across all dataset records in an experiment. Only used by summary evaluators.

| Field                | Type                   | Description                                          |
| -------------------- | ---------------------- | ---------------------------------------------------- |
| `inputs`             | `List[Any]`            | List of all input data from the experiment.          |
| `outputs`            | `List[Any]`            | List of all output data from the experiment.         |
| `expected_outputs`   | `List[Any]`            | List of all expected outputs from the experiment.    |
| `evaluation_results` | `Dict[str, List[Any]]` | Dictionary mapping evaluator names to their results. |
| `metadata`           | `Dict[str, Any]`       | Additional metadata associated with the experiment.  |

### Metric types{% #metric-types %}

The metric type is set when submitting an evaluation (through `submit_evaluation()` or the HTTP API) and determines how the value is validated and displayed in Datadog.

| Metric type   | Value type       | Use case                                                                                   |
| ------------- | ---------------- | ------------------------------------------------------------------------------------------ |
| `categorical` | `str`            | Classifying outputs into categories (for example, "Positive", "Negative", "Neutral")       |
| `score`       | `float` or `int` | Numeric scores or ratings (for example, 0.0-1.0, 1-10)                                     |
| `boolean`     | `bool`           | Pass/fail or yes/no evaluations                                                            |
| `json`        | `dict`           | Structured evaluation data (for example, multi-dimensional rubrics or detailed breakdowns) |

## Best practices{% #best-practices %}

### Naming conventions{% #naming-conventions %}

Evaluation labels must follow these conventions:

- Must start with a letter
- Must only contain ASCII alphanumerics, underscores, or hyphens
- Spaces and other unsupported characters are converted to underscores
- Unicode is not supported
- Must not exceed 200 characters (fewer than 100 is preferred)
- Must be unique for a given LLM application (`ml_app`) and organization

### Concurrent execution{% #concurrent-execution %}

Set the `jobs` parameter to run tasks and evaluators concurrently on multiple threads, allowing experiments to complete faster when processing multiple dataset records.

{% alert level="info" %}
Asynchronous evaluators are not yet supported for concurrent execution. Only synchronous evaluators benefit from parallel execution.
{% /alert %}

### OpenTelemetry integration{% #opentelemetry-integration %}

When submitting evaluations for [OpenTelemetry-instrumented spans](https://docs.datadoghq.com/llm_observability/instrumentation/otel_instrumentation), include the `source:otel` tag in the evaluation. See the [external evaluations documentation](https://docs.datadoghq.com/llm_observability/evaluations/external_evaluations) for examples.

## Further Reading{% #further-reading %}

- [Learn about submitting external evaluations](https://docs.datadoghq.com/llm_observability/evaluations/external_evaluations)
- [Learn about the LLM Observability SDK for Python](https://docs.datadoghq.com/llm_observability/setup/sdk/python)
- [Learn about the HTTP API Reference](https://docs.datadoghq.com/llm_observability/instrumentation/api)