For AI agents: A markdown version of this page is available at https://docs.datadoghq.com/llm_observability/playground.md. A documentation index is available at /llms.txt.

Playground

This product is not supported for your selected Datadog site. ().

The Playground is a browser-based interface for testing and evaluating LLM prompts. You can test a prompt with arbitrary input against any connected model provider, or load a dataset, attach evaluators, preview results, and save the configuration as a reproducible experiment—all without writing code.

The Playground supports two flows:

Prerequisites

Connect at least one model provider integration in AI Observability > Settings > Integrations before using the Playground.

Supported providers:

  • OpenAI
  • Anthropic
  • Azure OpenAI
  • Amazon Bedrock
  • Vertex AI
  • AI Gateway

Test a prompt with arbitrary input

LLM Observability Playground showing a user message with {{category}} and {{approach}} variable placeholders and the model response in the output panel.

Use Prompts mode to iterate on a prompt with a fixed set of inputs.

  1. Navigate to AI Observability > Playground.
  2. Write your system and user prompts in the message editor. To parameterize inputs, use {{variable_name}} in any message.
  3. Open the Model configuration panel using the top bar.
  4. In the configuration panel, select a Provider, an Account and a Model.
  5. (Optional) Click Edit Response Structure to request structured output from the model.
  6. (Optional) Click Model Parameters to specify the parameters of the model.
  7. (Optional) Click Add New next to Tools to add tool definitions in JSON function schema format. Use the provided examples (Weather, Web Search, Email, Stock Price) as starting points.
  8. You can enter values for each variable in the Variables section to substitute for the parameters you defined in your user and system prompts.
  9. Click Done to save the configuration and close the modal.
  10. Click Run to send the prompt and view the model response.

Edit messages, variable values, or model settings, then click Run again to iterate.

Run an experiment from the Playground

LLM Observability Playground in experiment mode with all steps completed and Save and Run as experiment active. The Experiment Preview table shows OUTPUT, EVAL_EXPECTED_OUTPUT, INPUT, and EXPECTEDOUTPUT columns with PASS and FAIL badges per row and a summary banner reading 6 of 20 records passed.

Use Experiments mode to test your prompt across a dataset, score results with evaluators, and save the configuration as a reproducible experiment.

Configure your prompt

Write your prompt using {{variable_name}} placeholders where dataset values will be substituted. Use dot notation to reference nested dataset fields—for example, {{input.question}} to reference a field named question in the input section of a record.

1. Add a dataset

Select a dataset from your project. The dataset provides the records the Playground runs your prompt against.

To create a dataset, see Datasets.

2. Add variables

Map dataset columns to the sections available in your prompt:

SectionDescription
InputColumns from the input section of each record, used to fill {{input.*}} variables. If the input value is a JSON object, its top-level keys are exposed as individual variables (for example, {{input.question}} and {{input.category}}). If the input value is a plain string or number, the whole field is available as {{input}}.
Expected OutputGround truth values used by evaluators to score model output. If the value is a JSON object, top-level keys are exposed individually (for example, {{expected_output.answer}}). Available as {{expected_output}} for plain values.
MetadataAdditional context columns. Top-level keys of a JSON object are available as {{metadata.*}} variables.

Click Use this dataset to proceed to the preview stage.

If a variable references a path that does not exist in the dataset—for example, {{question}} instead of {{input.question}}—the model receives the literal template string. Go back to your prompt and correct the variable paths to match the columns shown in the table above.

3. Add evaluators

Evaluators score each row after a preview run. Click Add Evaluators in the toolbar to open the evaluator configuration modal.

The Playground supports String Check evaluators. Add multiple evaluators to score different aspects of the output in one run.

FieldDescription
OperatorThe comparison to apply: equals, not equals, or contains.
Case sensitiveWhen enabled, the comparison is case-sensitive.
Strip whitespaceWhen enabled, leading and trailing whitespace is trimmed before comparing.
Left operandThe value to evaluate—defaults to the model output (output).
Right operandThe value to compare against—defaults to the expected output. Supports dot notation for nested fields.
NameAn alias displayed as the column header in the results table.

4. Run a preview

Click Run Preview to execute the prompt on up to 20 dataset records.

After the preview completes:

  • Each row shows a PASS or FAIL badge for each evaluator.
  • The column header shows the aggregate pass and fail counts.
  • A summary banner displays the overall result.

Click a FAIL badge to expand a popover showing the actual output, the operator, the expected value, and a contextual hint. For example, when an equals check fails because the output contains the expected value as a substring, the popover suggests switching to contains.

Iterate on prompt and evaluator configuration

LLM Observability Playground showing stale preview state after a prompt edit. A warning banner reads 'Prompt or settings changed since the last preview. Run the preview again before running the full dataset.' with a Re-run Preview button. The results table shows PASS and FAIL badges from the previous run.

After reviewing results, edit the prompt or evaluator configuration to improve scores. Any edit marks the preview results as stale. Click Re-run Preview to run again with the updated configuration.

Common iteration patterns:

  • If most rows fail equals, check whether contains or case-insensitive comparison better reflects the task.
  • If variable values appear as literals in the output (for example, {{input.question}}), correct the variable path in the prompt.
  • Adjust the prompt wording and re-run to observe the effect on pass rates.

5. Save the experiment

When the preview results meet your expectations, click Save & Run as experiment in the top toolbar to run on the full dataset.

In the dialog:

  1. Enter an Experiment name.
  2. Select a Project.
  3. Click Save.

The experiment runs across all records in the dataset—not only the 20-record preview sample. When complete, view results in AI Observability > Experiments.

Further reading