Trace Metrics

이 페이지는 아직 영어로 제공되지 않습니다. 번역 작업 중입니다.
현재 번역 프로젝트에 대한 질문이나 피드백이 있으신 경우 언제든지 연락주시기 바랍니다.

Overview

Tracing application metrics are collected after you enable trace collection and instrument your application.

These metrics capture request counts, error counts, and latency measures. They are calculated based on 100% of the application’s traffic, regardless of any trace ingestion sampling configuration. Ensure that you have full visibility into your application’s traffic by using these metrics to spot potential errors on a service or a resource, and by creating dashboards, monitors, and SLOs.

Note: If your applications are instrumented with OpenTelemetry libraries, and sampling is set up at the SDK level, APM metrics are calculated based on the sampled set of data. However, if sampling is set up at the OpenTelemetry Collector level and the sampler processor is upstream of the Datadog connector, APM metrics are calculated based on 100% of application traffic.

Trace metrics are generated for service entry spans and certain operations depending on integration language. For example, the Django integration produces trace metrics from spans that represent various operations (1 root span for the Django request, 1 for each middleware, and 1 for the view).

The trace metrics namespace is formatted as:

trace.<SPAN_NAME>.<METRIC_SUFFIX>

With the following definitions:

<SPAN_NAME>: The name of the operation or span.name (examples: redis.command, pylons.request, rails.request, mysql.query).
<METRIC_SUFFIX>: The name of the metric (examples: hits, errors, apdex, duration). See the section below.
<TAGS>: Trace metrics tags, possible tags are: env, service, version, resource, http.status_code, http.status_class, rpc.grpc.status_code(requires Datadog Agent v7.65.0+) , and Datadog Agent tags (including the host and additional primary tags).; Note: Other tags set on spans are not available as tags on traces metrics.

Metric suffix

Hits

trace.<SPAN_NAME>.hits: Prerequisite: This metric exists for any APM service.
Description: Represent the count of spans created with a specific name (for example, redis.command, pylons.request, rails.request, or mysql.query).
Metric type: COUNT.
Tags: env, service, version, resource, resource_name, http.status_code, rpc.grpc.status_code, all host tags from the Datadog Host Agent, and additional primary tags.
trace.<SPAN_NAME>.hits.by_http_status: Prerequisite: This metric exists for HTTP/WEB APM services if http metadata exists.
Description: Represent the count of hits for a given span break down by HTTP status code.
Metric type: COUNT.
Tags: env, service, version, resource, resource_name, http.status_class, http.status_code, all host tags from the Datadog Host Agent, and additional primary tags.

Latency distribution

trace.<SPAN_NAME>: Prerequisite: This metric exists for any APM service.
Description: Represent the latency distribution for all services, resources, and versions across different environments and additional primary tags. Recommended for all latency measurement use cases.
Metric type: DISTRIBUTION.
Tags: env, service,version, resource, resource_name, http.status_code, rpc.grpc.status_code, synthetics, and additional primary tags.

Errors

trace.<SPAN_NAME>.errors: Prerequisite: This metric exists for any APM service.
Description: Represent the count of errors for a given span.
Metric type: COUNT.
Tags: env, service, version, resource, resource_name, http.status_code, rpc.grpc.status_code, all host tags from the Datadog Host Agent, and additional primary tags.
trace.<SPAN_NAME>.errors.by_http_status: Prerequisite: This metric exists for any APM service.
Description: Represent the count of errors for a given span.
Metric type: COUNT.
Tags: env, service, version, resource, http.status_class, http.status_code, all host tags from the Datadog Host Agent, and additional primary tags.

Apdex

trace.<SPAN_NAME>.apdex: Prerequisite: This metric exists for any HTTP or web-based APM service.
Description: Measures the Apdex score for each web service.
Metric type: GAUGE.
Tags: env, service, version, resource / resource_name, synthetics, and additional primary tags.

Legacy metrics

The following metrics are maintained for backward compatibility. For all latency measurement use cases, Datadog strongly recommends using Latency Distribution metrics instead.

Duration (Legacy)

Important: Duration metrics are maintained for backward compatibility only. For all latency measurement use cases, Datadog strongly recommends using Latency Distribution metrics instead, as they provide better accuracy for percentile calculations and overall performance analysis.

trace.<SPAN_NAME>.duration: Prerequisite: This metric exists for any APM service.
Description: Measure the total time for a collection of spans within a time interval, including child spans seen in the collecting service. For most use cases, Datadog recommends using the Latency Distribution for calculation of average latency or percentiles. To calculate the average latency with host tag filters, you can use this metric with the following formula:
sum:trace.<SPAN_NAME>.duration{<FILTER>}.rollup(sum).fill(zero) / sum:trace.<SPAN_NAME>.hits{<FILTER>}.rollup(sum).fill(zero)
This metric does not support percentile aggregations. Read the Latency Distribution section for more information.
Metric type: GAUGE.
Tags: env, service, resource, http.status_code, all host tags from the Datadog Host Agent, and additional primary tags.

Duration by (Legacy)

trace.<SPAN_NAME>.duration.by_http_status: Prerequisite: This metric exists for HTTP/WEB APM services if http metadata exists.
Description: Measure the total time for a collection of spans for each HTTP status. Specifically, it is the relative share of time spent by all spans over an interval and a given HTTP status - including time spent waiting on child processes.
Metric type: GAUGE.
Tags: env, service, resource, http.status_class, http.status_code, all host tags from the Datadog Host Agent, and additional primary tags.

Sampling impact on trace metrics

In most cases, trace metrics are calculated based on all application traffic. However, with certain trace ingestion sampling configurations, the metrics represent only a subset of all requests.

Application-side sampling

Some tracing libraries support application-side sampling, which reduces the number of spans before they are sent to the Datadog Agent. For example, the Ruby tracing library offers application-side sampling to lower performance overhead. However, this can affect trace metrics, as the Datadog Agent needs all spans to calculate accurate metrics.

Very few tracing libraries support this setting, and using it is generally not recommended.

OpenTelemetry sampling

The OpenTelemetry SDK’s native sampling mechanisms lower the number of spans sent to the Datadog collector, resulting in sampled and potentially inaccurate trace metrics.