Ingestion Sampling with OpenTelemetry

Overview

If your applications and services are instrumented with OpenTelemetry libraries, you can:

Note: Datadog doesn’t support running the OpenTelemetry Collector and the Datadog Agent on the same host.

In the first scenario, APM Trace metrics are computed in the Datadog Connector (not the Datadog Exporter). In the second case, the Datadog Agent computes these metrics.

Otel APM Metrics computation

Both APM metrics and distributed traces are useful for you to monitor your application performance. Metrics are useful to spot increases in latency or error rates for specific resources while distributed traces allow you to drill down to the individual request level.

Why sampling is useful

The Datadog tracing libraries, the Datadog Agent, the OpenTelemetry SDKs, and the OpenTelemetry Collector all provide sampling capabilities because for most services, ingesting 100% of the traces is unnecessary in order to gain visibility into the health of your applications.

Configuring sampling rates before sending traces to Datadog allows you to:

  • Ingest the data that is most relevant to your business and your observability goals.
  • Reduce network costs by avoiding sending unused trace data to the Datadog platform.
  • Control and manage your overall costs.

Reducing your ingestion volume

With OpenTelemetry, you can configure sampling both in the OpenTelemetry libraries and in the OpenTelemetry collector:

  • Head-based Sampling in the OpenTelemetry SDKs
  • Tail-based Sampling in the OpenTelemetry Collector
Otel APM Metrics computation

SDK-level sampling

At the SDK level, you can implement head-based sampling, which is when the sampling decision is made at the beginning of the trace. This type of sampling is particularly useful for high throughput applications, for which you know that you do not need visibility over 100% of the traffic to monitor the application health. It can also help control the overhead introduced by OpenTelemetry.

TraceIdRatioBased and ParentBased are the SDK’s built-in samplers that allow you to implement deterministic head-based sampling based on the trace_id at the SDK level.

With head-based sampling, the APM metrics are computed on the sampled traffic, since only the sampled traffic is sent to the OpenTelemetry Collector or Datadog Agent, which is where the metrics calculation is done.

To get accurate stats, you can upscale the metrics by using formulas and functions in Datadog dashboards and monitors, provided that you know the configured sampling rate in the SDK.

Use the ingestion volume control guide to read more about the implications of setting up trace sampling on trace analytics monitors and metrics from spans.

Collector-level sampling

At the OpenTelemetry collector level, you can do tail-based sampling, which allows you to define more advanced rules to keep accrued visibility over error or high latency traces.

The Tail Sampling Processor and Probabilistic Sampling Processor allow you to sample traces based on a set of rules at the collector level.

Note: Tail sampling’s main limitation is that all spans for a given trace must be received by the same collector instance for effective sampling decisions. If the trace is distributed across multiple collector instances, there’s a risk that some parts of a trace are dropped whereas some other parts of the same trace are sent to Datadog.

To ensure that APM metrics are computed based on 100% of the applications’ traffic while using collector-level tail-based sampling, use the Datadog Connector.

The Datadog Connector is available starting v0.83.0. Read Switch from Datadog Processor to Datadog Connector for OpenTelemetry APM Metrics if migrating from an older version.

See the ingestion volume control guide for information about the implications of setting up trace sampling on trace analytics monitors and metrics from spans.

Sampling with the Datadog Agent

When using Datadog Agent OTLP Ingest, a probabilistic sampler is available starting from Agent version 7.44.0. Configure it using the DD_OTLP_CONFIG_TRACES_PROBABILISTIC_SAMPLER_SAMPLING_PERCENTAGE environment variable, or set the following YAML in your Agent’s configuration file:

otlp_config:
  # ...
  traces:
    probabilistic_sampler:
      sampling_percentage: 50

In the above example, 50% of traces are captured.

Note: Probabilistic sampler properties ensure that only complete traces are ingested, assuming you use the same sampling percentage across all Agents.

The probabilistic sampler ignores spans for which the sampling priority is already set at the SDK level. Additionally, spans not caught by the probabilistic sampler might still be captured by the Datadog Agent’s error and rare samplers, ensuring a higher representation of errors and rare endpoint traces in the ingested dataset.

Monitor ingested volumes from Datadog UI

You can leverage the APM Estimated Usage dashboard and the estimated usage metric datadog.estimated_usage.apm.ingested_bytes to get visibility into your ingested volumes for a specific time period. Filter the dashboard to specific environments and services to see which services are responsible for the largest shares of the ingested volume.

Further reading