- 필수 기능
- 시작하기
- Glossary
- 표준 속성
- Guides
- Agent
- 통합
- 개방형텔레메트리
- 개발자
- Administrator's Guide
- API
- Datadog Mobile App
- CoScreen
- Cloudcraft
- 앱 내
- 서비스 관리
- 인프라스트럭처
- 애플리케이션 성능
- APM
- Continuous Profiler
- 스팬 시각화
- 데이터 스트림 모니터링
- 데이터 작업 모니터링
- 디지털 경험
- 소프트웨어 제공
- 보안
- AI Observability
- 로그 관리
- 관리
",t};e.buildCustomizationMenuUi=t;function n(e){let t='
",t}function s(e){let n=e.filter.currentValue||e.filter.defaultValue,t='${e.filter.label}
`,e.filter.options.forEach(s=>{let o=s.id===n;t+=``}),t+="${e.filter.label}
`,t+=`Out-of-the-box evaluations are built-in tools to assess your LLM application on dimensions like quality, security, and safety. By enabling them, you can assess the effectiveness of your application’s responses, including detection of negative sentiment, topic relevancy, toxicity, failure to answer and hallucination.
LLM Observability associates evaluations with individual spans so you can view the inputs and outputs that led to a specific evaluation.
LLM Observability out-of-the-box evaluations leverage LLMs. To connect your LLM provider to Datadog, you need a key from the provider.
Connect your OpenAI account to LLM Observability with your OpenAI API key. LLM Observability uses the GPT-4o mini
model for evaluations.
Connect your Azure OpenAI account to LLM Observability with your OpenAI API key. We strongly recommend using the GPT-4o mini
model for evaluations.
Connect your Anthropic account to LLM Observability with your Anthropic API key. LLM Observability uses the Haiku
model for evaluations.
Connect your Amazon Bedrock account to LLM Observability with your AWS Account. LLM Observability uses the Haiku
model for evaluations.
After you click Save, LLM Observability uses the LLM account you connected to power the evaluation you enabled.
For more information about evaluations, see Terms and Concepts.
LLM Observability provides metrics to help you monitor and manage the token usage associated with evaluations that power LLM Observability. The following metrics allow you to track the LLM resources consumed to power evaluations:
ml_obs.estimated_usage.llm.input.tokens
ml_obs.estimated_usage.llm.output.tokens
ml_obs.estimated_usage.llm.total.tokens
Each of these metrics has ml_app
, model_server
, model_provider
, model_name
, and evaluation_name
tags, allowing you to pinpoint specific applications, models, and evaluations contributing to your usage.
This check identifies and flags user inputs that deviate from the configured acceptable input topics. This ensures that interactions stay pertinent to the LLM’s designated purpose and scope.
Evaluation Stage | Evaluation Method | Evaluation Definition |
---|---|---|
Evaluated on Input | Evaluated using LLM | Topic relevancy assesses whether each prompt-response pair remains aligned with the intended subject matter of the Large Language Model (LLM) application. For instance, an e-commerce chatbot receiving a question about a pizza recipe would be flagged as irrelevant. |
You can provide topics for this evaluation.
Topics can contain multiple words and should be as specific and descriptive as possible. For example, for an LLM application that was designed for incident management, add “observability”, “software engineering”, or “incident resolution”. If your application handles customer inquiries for an e-commerce store, you can use “Customer questions about purchasing furniture on an e-commerce store”.
This check identifies instances where the LLM makes a claim that disagrees with the provided input context.
Evaluation Stage | Evaluation Method | Evaluation Definition |
---|---|---|
Evaluated on Output | Evaluated using LLM | Hallucination flags any output that disagrees with the context provided to the LLM. |
In order to take advantage of Hallucination detection, you will need to annotate LLM spans with the user query and context:
from ddtrace.llmobs import LLMObs
from ddtrace.llmobs.utils import Prompt
# if your llm call is auto-instrumented...
with LLMObs.annotation_context(
prompt=Prompt(
variables={"user_question": user_question, "article": article},
rag_query_variables=["user_question"],
rag_context_variables=["article"]
),
name="generate_answer"
):
oai_client.chat.completions.create(...) # autoinstrumented llm call
# if your llm call is manually instrumented ...
@llm(name="generate_answer")
def generate_answer():
...
LLMObs.annotate(
prompt=Prompt(
variables={"user_question": user_question, "article": article},
rag_query_variables=["user_question"],
rag_context_variables=["article"]
),
)
The variables dictionary should contain the key-value pairs your app uses to construct the LLM input prompt (for example, the messages for an OpenAI chat completion request). Set rag_query_variables
and rag_context_variables
to indicate which variables constitute the query and the context, respectively. A list of variables is allowed to account for cases where multiple variables make up the context (for example, multiple articles retrieved from a knowledge base).
Hallucination detection makes a distinction between two types of hallucinations, which can be configured when Hallucination is enabled.
Configuration Option | Description |
---|---|
Contradiction | Claims made in the LLM-generated response that go directly against the provided context |
Unsupported Claim | Claims made in the LLM-generated response that are not grounded in the context |
Contradictions are always detected, while Unsupported Claims can be optionally included. For sensitive use cases, we recommend including Unsupported Claims.
Hallucination detection is only available for OpenAI.
This check identifies instances where the LLM fails to deliver an appropriate response, which may occur due to limitations in the LLM’s knowledge or understanding, ambiguity in the user query, or the complexity of the topic.
Evaluation Stage | Evaluation Method | Evaluation Definition |
---|---|---|
Evaluated on Output | Evaluated using LLM | Failure To Answer flags whether each prompt-response pair demonstrates that the LLM application has provided a relevant and satisfactory answer to the user’s question. |
The types of Failure to Answer are defined below and can be configured when the Failure to Answer evaluation is enabled.
Configuration Option | Description | Example(s) |
---|---|---|
Empty Code Response | An empty code object, like an empty list or tuple, signifiying no data or results | (), [], {}, “”, '' |
Empty Response | No meaningful response, returning only whitespace | whitespace |
No Content Response | An empty output accompanied by a message indicating no content is available | Not found, N/A |
Redirection Response | Redirects the user to another source or suggests an alternative approach | If you have additional details, I’d be happy to include them |
Refusal Response | Explicitly declines to provide an answer or to complete the request | Sorry, I can’t answer this question |
This check identifies instances where the LLM generates responses in a different language or dialect than the one used by the user, which can lead to confusion or miscommunication. This check ensures that the LLM’s responses are clear, relevant, and appropriate for the user’s linguistic preferences and needs.
Language mismatch is only supported for natural language prompts. Input and output pairs that mainly consist of structured data such as JSON, code snippets, or special characters are not flagged as a language mismatch.
Evaluation Stage | Evaluation Method | Evaluation Definition |
---|---|---|
Evaluated on Input and Output | Evaluated using Open Source Model | Language Mismatch flags whether each prompt-response pair demonstrates that the LLM application answered the user’s question in the same language that the user used. |
This check helps understand the overall mood of the conversation, gauge user satisfaction, identify sentiment trends, and interpret emotional responses. This check accurately classifies the sentiment of the text, providing insights to improve user experiences and tailor responses to better meet user needs.
Evaluation Stage | Evaluation Method | Evaluation Definition |
---|---|---|
Evaluated on Input and Output | Evaluated using LLM | Sentiment flags the emotional tone or attitude expressed in the text, categorizing it as positive, negative, or neutral. |
This check evaluates each input prompt from the user and the response from the LLM application for toxic content. This check identifies and flags toxic content to ensure that interactions remain respectful and safe.
Evaluation Stage | Evaluation Method | Evaluation Definition |
---|---|---|
Evaluated on Input and Output | Evaluated using LLM | Toxicity flags any language or behavior that is harmful, offensive, or inappropriate, including but not limited to hate speech, harassment, threats, and other forms of harmful communication. |
This check identifies attempts by unauthorized or malicious authors to manipulate the LLM’s responses or redirect the conversation in ways not intended by the original author. This check maintains the integrity and authenticity of interactions between users and the LLM.
Evaluation Stage | Evaluation Method | Evaluation Definition |
---|---|---|
Evaluated on Input | Evaluated using LLM | Prompt Injection flags any unauthorized or malicious insertion of prompts or cues into the conversation by an external party or user. |
This check ensures that sensitive information is handled appropriately and securely, reducing the risk of data breaches or unauthorized access.
Evaluation Stage | Evaluation Method | Evaluation Definition |
---|---|---|
Evaluated on Input and Output | Sensitive Data Scanner | Powered by the Sensitive Data Scanner, LLM Observability scans, identifies, and redacts sensitive information within every LLM application’s prompt-response pairs. This includes personal information, financial data, health records, or any other data that requires protection due to privacy or security concerns. |