---
title: External Evaluations
description: Datadog, the leading service for cloud-scale monitoring.
breadcrumbs: Docs > LLM Observability > Evaluations > External Evaluations
---

# External Evaluations

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com

{% alert level="danger" %}
This product is not supported for your selected [Datadog site](https://docs.datadoghq.com/getting_started/site). ().
{% /alert %}

{% /callout %}

## Overview{% #overview %}

In the context of LLM applications, it's important to track user feedback and evaluate the quality of your LLM application's responses. While LLM Observability provides a few out-of-the-box evaluations for your traces, you can submit your own evaluations to LLM Observability in two ways: with Datadog's SDK, or with the LLM Observability API. Use this naming convention for the evaluation label:

- Evaluation labels must start with a letter.
- Evaluation labels must only contain ASCII alphanumerics or underscores.
  - Other characters, including spaces, are converted to underscores.
  - Unicode is not supported.
- Evaluation labels must not exceed 200 characters. Fewer than 100 is preferred from a UI perspective.

{% alert level="info" %}
Evaluation labels must be unique for a given LLM application (`ml_app`) and organization.
{% /alert %}

## Submitting external evaluations with the SDK{% #submitting-external-evaluations-with-the-sdk %}

The LLM Observability SDK provides the methods `LLMObs.submit_evaluation()` and `LLMObs.export_span()` to help your traced LLM application submit external evaluations to LLM Observability. See the [Python](https://docs.datadoghq.com/llm_observability/setup/sdk/python/#evaluations) or [Node.js](https://docs.datadoghq.com/llm_observability/setup/sdk/nodejs/#evaluations) SDK documentation for more details.

{% alert level="info" %}
For building reusable, class-based evaluators with rich result metadata, see the [Evaluation Developer Guide](https://docs.datadoghq.com/llm_observability/guide/evaluation_developer_guide/).
{% /alert %}

### Example{% #example %}

```python
from ddtrace.llmobs import LLMObs
from ddtrace.llmobs.decorators import llm

def my_harmfulness_eval(input: Any) -> float:
  score = ... # custom harmfulness evaluation logic

  return score

@llm(model_name="claude", name="invoke_llm", model_provider="anthropic")
def llm_call():
    completion = ... # user application logic to invoke LLM

    # joining an evaluation to a span via span ID and trace ID
    span_context = LLMObs.export_span(span=None)
    LLMObs.submit_evaluation(
        span = span_context,
        ml_app = "chatbot",
        label="harmfulness",
        metric_type="score", # can be score or categorical
        value=my_harmfulness_eval(completion),
        tags={"type": "custom"},
        timestamp_ms=1765990800016, # optional, unix timestamp in milliseconds
        assessment="pass", # optional, "pass" or "fail"
        reasoning="it makes sense", # optional, judge llm reasoning
    )
```

## Submitting external evaluations with the API{% #submitting-external-evaluations-with-the-api %}

You can use the evaluations API provided by LLM Observability to send evaluations associated with spans to Datadog. See the [Evaluations API](https://docs.datadoghq.com/llm_observability/setup/api/?tab=model#evaluations-api) for more details on the API specifications. For building reusable evaluators, see the [Evaluation Developer Guide](https://docs.datadoghq.com/llm_observability/guide/evaluation_developer_guide).

To submit evaluations for [OpenTelemetry spans](https://docs.datadoghq.com/llm_observability/instrumentation/otel_instrumentation) directly to the Evaluations API, you must include the `source:otel` tag in the evaluation. Additionally, `span_id` and `trace_id` values must be provided as **decimal** strings. If your OpenTelemetry instrumentation produces hexadecimal IDs, convert them to decimal before submitting. For example, in Python: `str(int(hex_span_id, 16))`.

### Example{% #example-1 %}

```json
{
  "data": {
    "type": "evaluation_metric",
    "id": "456f4567-e89b-12d3-a456-426655440000",
    "attributes": {
      "metrics": [
        {
          "id": "cdfc4fc7-e2f6-4149-9c35-edc4bbf7b525",
          "join_on": {
            "tag": {
              "key": "msg_id",
              "value": "1123132"
            }
          },
          "span_id": "20245611112024561111",
          "trace_id": "13932955089405749200",
          "ml_app": "weather-bot",
          "timestamp_ms": 1609479200,
          "metric_type": "score",
          "label": "Accuracy",
          "score_value": 3,
          // source:otel required only for OpenTelemetry spans
          "tags": ["source:otel"],
          "timestamp_ms": 1765990800016,
          "assessment": "pass",
          "reasoning": "it makes sense"
        }
      ]
    }
  }
}
```

## Further Reading{% #further-reading %}

- [Learn about building custom evaluators](https://docs.datadoghq.com/llm_observability/guide/evaluation_developer_guide)
- [Learn about the LLM Observability SDK for Python](https://docs.datadoghq.com/llm_observability/setup/sdk)
- [Learn about the Evaluations API](https://docs.datadoghq.com/llm_observability/setup/api)
- [Learn about submitting evaluations from NVIDIA NeMo](https://docs.datadoghq.com/llm_observability/evaluations/submit_nemo_evaluations)