Este producto no es compatible con el sitio Datadog seleccionado. ().
Esta página aún no está disponible en español. Estamos trabajando en su traducción.
Si tienes alguna pregunta o comentario sobre nuestro actual proyecto de traducción, no dudes en ponerte en contacto con nosotros.

Quality evaluations help ensure your LLM-powered applications generate accurate, relevant, and safe responses. Managed evaluations automatically score model outputs on key quality dimensions and attach results to traces, helping you detect issues, monitor trends, and improve response quality over time.

Topic relevancy

This check identifies and flags user inputs that deviate from the configured acceptable input topics. This ensures that interactions stay pertinent to the LLM’s designated purpose and scope.

A topic relevancy evaluation detected by an LLM in LLM Observability
Evaluation StageEvaluation MethodEvaluation Definition
Evaluated on InputEvaluated using LLMTopic relevancy assesses whether each prompt-response pair remains aligned with the intended subject matter of the Large Language Model (LLM) application. For instance, an e-commerce chatbot receiving a question about a pizza recipe would be flagged as irrelevant.

You can provide topics for this evaluation.

  1. Go to LLM Observability > Applications.
  2. Select the application you want to add topics for.
  3. At the right corner of the top panel, select Settings.
  4. Beside Topic Relevancy, click Configure Evaluation.
  5. Click the Edit Evaluations icon for Topic Relevancy.
  6. Add topics on the configuration page.

Topics can contain multiple words and should be as specific and descriptive as possible. For example, for an LLM application that was designed for incident management, add “observability”, “software engineering”, or “incident resolution”. If your application handles customer inquiries for an e-commerce store, you can use “Customer questions about purchasing furniture on an e-commerce store”.

Hallucination

This check identifies instances where the LLM makes a claim that disagrees with the provided input context.

A Hallucination evaluation detected by an LLM in LLM Observability
Evaluation StageEvaluation MethodEvaluation Definition
Evaluated on OutputEvaluated using LLMHallucination flags any output that disagrees with the context provided to the LLM.
Instrumentation

You can use Prompt Tracking annotations to track your prompts and set them up for hallucination configuration. Annotate your LLM spans with the user query and context so hallucination detection can evaluate model outputs against the retrieved data.

from ddtrace.llmobs import LLMObs
from ddtrace.llmobs.types import Prompt

# if your llm call is auto-instrumented...
with LLMObs.annotation_context(
        prompt=Prompt(
            id="generate_answer_prompt",
            template="Generate an answer to this question :{user_question}. Only answer based on the information from this article : {article}",
            variables={"user_question": user_question, "article": article},
            rag_query_variables=["user_question"],
            rag_context_variables=["article"]
        ),
        name="generate_answer"
):
    oai_client.chat.completions.create(...) # autoinstrumented llm call

# if your llm call is manually instrumented ...
@llm(name="generate_answer")
def generate_answer():
  ...
  LLMObs.annotate(
            prompt=Prompt(
                id="generate_answer_prompt",
                template="Generate an answer to this question :{user_question}. Only answer based on the information from this article : {article}",
                variables={"user_question": user_question, "article": article},
                rag_query_variables=["user_question"],
                rag_context_variables=["article"]
            ),
  )
The variables dictionary should contain the key-value pairs your app uses to construct the LLM input prompt (for example, the messages for an OpenAI chat completion request). Use rag_query_variables and rag_context_variables to specify which variables represent the user query and which represent the retrieval context. A list of variables is allowed to account for cases where multiple variables make up the context (for example, multiple articles retrieved from a knowledge base).

Hallucination detection does not run if either the rag query, the rag context, or the span output is empty.

Prompt Tracking is available on python starting from the 3.15 version, It also requires an ID for the prompt and the template set up to monitor and track your prompt versions. You can find more examples of prompt tracking and instrumentation in the SDK documentation.

Hallucination configuration
Hallucination detection is only available for OpenAI.
Hallucination detection makes a distinction between two types of hallucinations, which can be configured when Hallucination is enabled.
Configuration OptionDescription
ContradictionClaims made in the LLM-generated response that go directly against the provided context
Unsupported ClaimClaims made in the LLM-generated response that are not grounded in the context

Contradictions are always detected, while Unsupported Claims can be optionally included. For sensitive use cases, we recommend including Unsupported Claims.

Failure to Answer

This check identifies instances where the LLM fails to deliver an appropriate response, which may occur due to limitations in the LLM’s knowledge or understanding, ambiguity in the user query, or the complexity of the topic.

A Failure to Answer evaluation detected by an LLM in LLM Observability
Evaluation StageEvaluation MethodEvaluation Definition
Evaluated on OutputEvaluated using LLMFailure To Answer flags whether each prompt-response pair demonstrates that the LLM application has provided a relevant and satisfactory answer to the user’s question.
Failure to Answer Configuration
Configuring failure to answer evaluation categories is supported if OpenAI or Azure OpenAI is selected as your LLM provider.
You can configure the Failure to Answer evaluation to use specific categories of failure to answer, listed in the following table.
Configuration OptionDescriptionExample(s)
Empty Code ResponseAn empty code object, like an empty list or tuple, signifiying no data or results(), [], {}, “”, ''
Empty ResponseNo meaningful response, returning only whitespacewhitespace
No Content ResponseAn empty output accompanied by a message indicating no content is availableNot found, N/A
Redirection ResponseRedirects the user to another source or suggests an alternative approachIf you have additional details, I’d be happy to include them
Refusal ResponseExplicitly declines to provide an answer or to complete the requestSorry, I can’t answer this question

Language Mismatch

This check identifies instances where the LLM generates responses in a different language or dialect than the one used by the user, which can lead to confusion or miscommunication. This check ensures that the LLM’s responses are clear, relevant, and appropriate for the user’s linguistic preferences and needs.

Language mismatch is only supported for natural language prompts. Input and output pairs that mainly consist of structured data such as JSON, code snippets, or special characters are not flagged as a language mismatch.

Afrikaans, Albanian, Arabic, Armenian, Azerbaijani, Belarusian, Bengali, Norwegian Bokmal, Bosnian, Bulgarian, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Georgian, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Mongolian, Norwegian Nynorsk, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Serbian, Slovak, Slovene, Spanish, Swahili, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Vietnamese, Yoruba, Zulu

A Language Mismatch evaluation detected by an open source model in LLM Observability
Evaluation StageEvaluation MethodEvaluation Definition
Evaluated on Input and OutputEvaluated using Open Source ModelLanguage Mismatch flags whether each prompt-response pair demonstrates that the LLM application answered the user’s question in the same language that the user used.

Sentiment

This check helps understand the overall mood of the conversation, gauge user satisfaction, identify sentiment trends, and interpret emotional responses. This check accurately classifies the sentiment of the text, providing insights to improve user experiences and tailor responses to better meet user needs.

A Sentiment evaluation detected by an LLM in LLM Observability
Evaluation StageEvaluation MethodEvaluation Definition
Evaluated on Input and OutputEvaluated using LLMSentiment flags the emotional tone or attitude expressed in the text, categorizing it as positive, negative, or neutral.