This product is not supported for your selected Datadog site. ().
Overview
This guide covers how to build custom evaluators with the LLM Observability SDK and use them in LLM Experiments and in production.
Key concepts
An evaluation measures a specific quality of your LLM application’s output, such as accuracy, tone, or harmfulness. You write the evaluation logic inside an evaluator, which receives context about the LLM interaction and returns a result.
Running evaluators in an Experiment
To test your LLM application against a dataset before deploying, run your evaluators in LLM Experiments. In Experiments, evaluators run automatically: the SDK calls your evaluator on each distinct record. Use evaluators through the SDK.
Running evaluators in production
To monitor the quality of your live LLM responses, run evaluators in production. You can run evaluators manually with submit_evaluation(), or automatically with custom LLM-as-a-judge evaluations. Use evaluators through the SDK, HTTP API, or the Datadog UI.
For production, there are two approaches:
Manual evaluations (this guide): You run evaluators in your application code and submit results with LLMObs.submit_evaluation() or the HTTP API. This gives you full control over evaluation logic and timing.
Custom LLM-as-a-judge evaluations: You configure evaluations in the Datadog UI using natural language prompts. Datadog automatically runs them on production traces in real time, with no code changes required.
EvaluatorContext: The input to an evaluator. Contains the LLM’s input, output, expected output, and span identifiers. In Experiments, the SDK builds this automatically from each dataset record. In production, you construct the EvaluatorContext yourself.
EvaluatorResult: The output of an evaluator. Contains a typed value, optional reasoning, a pass/fail assessment, metadata, and tags. You can also return a plain value (str, float, int, bool, dict) instead.
Metric type: Determines how the evaluation value is interpreted and displayed: categorical (string labels), score (numeric), boolean (pass/fail), or json (structured data).
SummaryEvaluatorContext — Experiments only. After all dataset records are evaluated, summary evaluators receive the aggregated results to compute statistics like averages or pass rates.
The typical flow:
Experiments: Dataset record → EvaluatorContext → Evaluator → EvaluatorResult → (after all records) SummaryEvaluatorContext → Summary evaluator → summary result
Production: Span data → EvaluatorContext (built manually) → Evaluator → EvaluatorResult → LLMObs.submit_evaluation() or HTTP API
Building evaluators
There are two ways to define an evaluator: class-based and function-based.
Class-based
Function-based
Best for
Reusable evaluators with custom configuration or state.
One-off evaluators with straightforward logic.
Receives
An EvaluatorContext object with full span context (input, output, expected output, metadata, span/trace IDs).
input_data, output_data, and expected_output as separate arguments.
Supports summary evaluators
Yes (BaseSummaryEvaluator).
No.
If you are unsure, start with class-based evaluators. They provide the same capabilities as function-based evaluators.
Class-based evaluators
Class-based evaluators provide a structured way to implement reusable evaluation logic with custom configuration.
BaseEvaluator
Subclass BaseEvaluator to create an evaluator that runs on a single span or dataset record. Implement the evaluate method, which receives an EvaluatorContext and returns an EvaluatorResult (or a plain value).
fromddtrace.llmobsimportBaseEvaluator,EvaluatorContext,EvaluatorResultclassSemanticSimilarityEvaluator(BaseEvaluator):"""Evaluates semantic similarity between output and expected output."""def__init__(self,threshold:float=0.8):super().__init__(name="semantic_similarity")self.threshold=thresholddefevaluate(self,context:EvaluatorContext)->EvaluatorResult:score=compute_similarity(context.output_data,context.expected_output)returnEvaluatorResult(value=score,reasoning=f"Similarity score: {score:.2f}",assessment="pass"ifscore>=self.thresholdelse"fail",metadata={"threshold":self.threshold},tags={"type":"semantic"})
Call super().__init__(name="evaluator_name") to set the evaluator’s label.
Implement evaluate(context: EvaluatorContext) with your evaluation logic.
Return an EvaluatorResult for rich results, or a plain value (str, float, int, bool, dict).
BaseSummaryEvaluator
Summary evaluators are only available in experiments.
Subclass BaseSummaryEvaluator to create an evaluator that operates on the aggregated results of an entire experiment run. It receives a SummaryEvaluatorContext containing all inputs, outputs, and per-evaluator results.
fromddtrace.llmobsimportBaseSummaryEvaluator,SummaryEvaluatorContextclassAverageScoreEvaluator(BaseSummaryEvaluator):"""Computes average score across all evaluation results."""def__init__(self,target_evaluator:str):super().__init__(name="average_score")self.target_evaluator=target_evaluatordefevaluate(self,context:SummaryEvaluatorContext):scores=context.evaluation_results.get(self.target_evaluator,[])ifnotscores:returnNonereturnsum(scores)/len(scores)
Call super().__init__(name="evaluator_name") to set the evaluator’s label.
Access per-evaluator results through context.evaluation_results, which maps evaluator names to lists of results.
Function-based evaluators
For straightforward evaluation logic, define a function instead of a class. Function-based evaluators receive the input, output, and expected output directly as arguments.
fromddtrace.llmobsimportEvaluatorResultdefexact_match_evaluator(input_data,output_data,expected_output):"""Checks if output exactly matches expected output."""matches=output_data==expected_outputreturnEvaluatorResult(value=matches,reasoning="Exact match"ifmatcheselse"Output differs from expected",assessment="pass"ifmatcheselse"fail",)
An EvaluatorResult for rich results with reasoning and metadata
Using evaluators in experiments
Pass your evaluators to LLMObs.experiment() to run them against every record in a dataset. The SDK automatically builds an EvaluatorContext for each record and calls your evaluator. After all records are processed, any summary evaluators run on the aggregated results.
fromddtrace.llmobsimportLLMObs,Dataset,DatasetRecord# Create datasetdataset=Dataset(name="qa_dataset",records=[DatasetRecord(input_data={"question":"What is 2+2?"},expected_output="4"),DatasetRecord(input_data={"question":"What is the capital of France?"},expected_output="Paris"),])# Define taskdefqa_task(input_data,config):returngenerate_answer(input_data["question"])# Create evaluatorssemantic_eval=SemanticSimilarityEvaluator(threshold=0.7)summary_eval=AverageScoreEvaluator("semantic_similarity")# Run experimentexperiment=LLMObs.experiment(name="qa_experiment",task=qa_task,dataset=dataset,evaluators=[semantic_eval,exact_match_evaluator],summary_evaluators=[summary_eval])experiment.run()
To submit evaluations from your application code, construct the EvaluatorContext yourself, call the evaluator, and submit the result with LLMObs.submit_evaluation(). You can also submit evaluations through the HTTP API.
fromddtrace.llmobsimportLLMObs,EvaluatorContextfromddtrace.llmobs.decoratorsimportllmevaluator=SemanticSimilarityEvaluator(threshold=0.8)@llm(model_name="claude",name="invoke_llm",model_provider="anthropic")defllm_call(input_text):completion=...# Your LLM application logic# Build the evaluation context from the span datacontext=EvaluatorContext(input_data=input_text,output_data=completion,expected_output=None,)# Run the evaluatorresult=evaluator.evaluate(context)# Submit the result to DatadogLLMObs.submit_evaluation(span=LLMObs.export_span(),ml_app="chatbot",label=evaluator.name,metric_type="score",value=result.value,assessment=result.assessment,reasoning=result.reasoning,)returncompletion
Data model reference
EvaluatorContext
A frozen dataclass containing all the information needed to run an evaluation.
Field
Type
Description
input_data
Any
The input provided to the LLM application (for example, a prompt).
output_data
Any
The actual output from the LLM application.
expected_output
Any
The expected or ideal output the LLM should have produced.
metadata
Dict[str, Any]
Additional metadata.
span_id
str
The span’s unique identifier.
trace_id
str
The trace’s unique identifier.
In Experiments, the SDK populates this automatically from each dataset record. In production, you construct it yourself from your span data.
EvaluatorResult
Allows you to return rich evaluation results with additional context. Used in both Experiments and production.
Field
Type
Description
value
Union[str, float, int, bool, dict]
The evaluation value. Type depends on metric_type.
reasoning
Optional[str]
A text explanation of the evaluation result.
assessment
Optional[str]
An assessment of this evaluation. Accepted values are pass and fail.
metadata
Optional[Dict[str, Any]]
Additional metadata about the evaluation.
tags
Optional[Dict[str, str]]
Tags to apply to the evaluation metric.
SummaryEvaluatorContext
A frozen dataclass providing aggregated evaluation results across all dataset records in an experiment. Only used by summary evaluators.
Field
Type
Description
inputs
List[Any]
List of all input data from the experiment.
outputs
List[Any]
List of all output data from the experiment.
expected_outputs
List[Any]
List of all expected outputs from the experiment.
evaluation_results
Dict[str, List[Any]]
Dictionary mapping evaluator names to their results.
metadata
Dict[str, Any]
Additional metadata associated with the experiment.
Metric types
The metric type is set when submitting an evaluation (through submit_evaluation() or the HTTP API) and determines how the value is validated and displayed in Datadog.
Metric type
Value type
Use case
categorical
str
Classifying outputs into categories (for example, “Positive”, “Negative”, “Neutral”)
score
float or int
Numeric scores or ratings (for example, 0.0-1.0, 1-10)
boolean
bool
Pass/fail or yes/no evaluations
json
dict
Structured evaluation data (for example, multi-dimensional rubrics or detailed breakdowns)
Best practices
Naming conventions
Evaluation labels must follow these conventions:
Must start with a letter
Must only contain ASCII alphanumerics or underscores
Other characters, including spaces, are converted to underscores
Unicode is not supported
Must not exceed 200 characters (fewer than 100 is preferred)
Must be unique for a given LLM application (ml_app) and organization
Concurrent execution
Set the jobs parameter to run tasks and evaluators concurrently on multiple threads, allowing experiments to complete faster when processing multiple dataset records.
Asynchronous evaluators are not yet supported for concurrent execution. Only synchronous evaluators benefit from parallel execution.