New announcements for Serverless, Network, RUM, and more from Dash! New announcements from Dash!

Trace Sampling and Storage

Trace sampling

Trace Sampling is applicable for high-volume web-scale applications, where a sampled proportion of traces is kept in Datadog based on the following rules.

Statistics (requests, errors, latency, etc.), are calculated based on the full volume of traces at the Agent level, and are therefore always accurate.

Statistics (Requests, Errors, Latencies etc.)

Datadog APM computes following aggregate statistics over all the traces instrumented, regardless of sampling:

  • Total requests and requests per second
  • Total errors and errors per second
  • Latency
  • Breakdown of time spent by service/type
  • Apdex score (web services only)

Goal of Sampling

The goal of sampling is to keep the traces that matter the most:

  • Distributed traces
  • Low QPS Services
  • Representative variety set of traces

Sampling Rules

For the lifecycle of a trace, decisions are made at Tracing Client, Agent, and Backend level in the following order.

  1. Tracing Client - The tracing client adds a context attribute sampling.priority to traces, allowing a single trace to be propagated in a distributed architecture across language agnostic request headers. Sampling-priority attribute is a hint to the Datadog Agent to do its best to prioritize the trace or drop unimportant ones.

    ValueTypeAction
    MANUAL_DROPUser inputThe Agent drops the trace.
    AUTO_DROPAutomatic sampling decisionThe Agent drops the trace.
    AUTO_KEEPAutomatic sampling decisionThe Agent keeps the trace.
    MANUAL_KEEPUser inputThe Agent keeps the trace, and the backend will only apply sampling if above maximum volume allowed.

    Traces are automatically assigned a priority of AUTO_DROP or AUTO_KEEP, with a proportion ensuring that the Agent won’t have to sample more than it is allowed. Users can manually adjust this attribute to give priority to specific types of traces, or entirely drop uninteresting ones.

  2. Trace Agent (Host or Container Level)- The Agent receives traces from various tracing clients and filters requests based on two rules -

    • Ensure traces are kept across variety of traces. (across services, resources, HTTP status codes, errors)
    • Ensure traces are kept for low volume resources (web endpoints, DB queries).

    The Agent computes a signature for every trace reported, based on its services, resources, errors, etc.. Traces of the same signature are considered similar. For example, a signature could be:

    • env=prod, my_web_service, is_error=true, resource=/login
    • env=staging, my_database_service, is_error=false, query=SELECT...

    A proportion of traces with each signature is then kept, so you get full visibility into all the different kinds of traces happening in your system. This method ensures traces for resources with low volumes are still kept.

    Moreover, the Agent provides a service-based rate to the prioritized traces from tracing client to ensure traces from low QPS services are prioritized to be kept.

    Users can manually drop entire uninteresting resource endpoints at Agent level by using resource filtering.

  3. DD Backend/Server - The server receives traces from various Agents running on hosts and applies sampling to ensure representation from every reporting Agent. It does so by keeping traces on the basis of the signature marked by Agent.

Manually Control Trace Priority

APM enables distributed tracing by default to allow trace propagation between tracing headers across multiple services/hosts. Tracing headers include a priority tag to ensure complete traces between upstream and downstream services during trace propagation. You can override this tag to manually keep a trace (critical transaction, debug mode, etc.) or drop a trace (health checks, static assets, etc).

Manually keep a trace:

import datadog.trace.api.DDTags;
import datadog.trace.api.interceptor.MutableSpan;
import datadog.trace.api.Trace;
import io.opentracing.util.GlobalTracer;

public class MyClass {
    @Trace
    public static void myMethod() {
        // grab the active span out of the traced method
        MutableSpan ddspan = (MutableSpan) GlobalTracer.get().activeSpan();
        // Always keep the trace
        ddspan.setTag(DDTags.MANUAL_KEEP, true);
        // method impl follows
    }
}

Manually drop a trace:

import datadog.trace.api.DDTags;
import datadog.trace.api.interceptor.MutableSpan;
import datadog.trace.api.Trace;
import io.opentracing.util.GlobalTracer;

public class MyClass {
    @Trace
    public static void myMethod() {
        // grab the active span out of the traced method
        MutableSpan ddspan = (MutableSpan) GlobalTracer.get().activeSpan();
        // Always Drop the trace
        ddspan.setTag(DDTags.MANUAL_DROP, true);
        // method impl follows
    }
}

Manually keep a trace:

from ddtrace import tracer
from ddtrace.constants import MANUAL_DROP_KEY, MANUAL_KEEP_KEY

@tracer.wrap()
def handler():
    span = tracer.current_span()
    // Always Keep the Trace
    span.set_tag(MANUAL_KEEP_KEY)
    // method impl follows

Manually drop a trace:

from ddtrace import tracer
from ddtrace.constants import MANUAL_DROP_KEY, MANUAL_KEEP_KEY

@tracer.wrap()
def handler():
    span = tracer.current_span()
        //Always Drop the Trace
        span.set_tag(MANUAL_DROP_KEY)
        //method impl follows

Manually keep a trace:

Datadog.tracer.trace(name, options) do |span|

  # Always Keep the Trace
  span.set_tag(Datadog::Ext::ManualTracing::TAG_KEEP, true)
  # method impl follows
end

Manually drop a trace:

Datadog.tracer.trace(name, options) do |span|
  # Always Drop the Trace
  span.set_tag(Datadog::Ext::ManualTracing::TAG_DROP, true)
  # method impl follows
end

Manually keep a trace:

package main

import (
    "log"
    "net/http"
    "gopkg.in/DataDog/dd-trace-go.v1/ddtrace/ext"
    "gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer"
)

func handler(w http.ResponseWriter, r *http.Request) {
    // Create a span for a web request at the /posts URL.
    span := tracer.StartSpan("web.request", tracer.ResourceName("/posts"))
    defer span.Finish()

    // Always keep this trace:
    span.SetTag(ext.ManualKeep, true)
    //method impl follows

}

Manually drop a trace:

package main

import (
    "log"
    "net/http"

    "gopkg.in/DataDog/dd-trace-go.v1/ddtrace/ext"
    "gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer"
)

func handler(w http.ResponseWriter, r *http.Request) {
    // Create a span for a web request at the /posts URL.
    span := tracer.StartSpan("web.request", tracer.ResourceName("/posts"))
    defer span.Finish()

    // Always drop this trace:
    span.SetTag(ext.ManualDrop, true)
    //method impl follows
}

Manually keep a trace:

const tracer = require('dd-trace')
const tags = require('dd-trace/ext/tags')

const span = tracer.startSpan('web.request')

// Always keep the trace
span.setTag(tags.MANUAL_KEEP)
//method impl follows

Manually drop a trace:

const tracer = require('dd-trace')
const tags = require('dd-trace/ext/tags')

const span = tracer.startSpan('web.request')

// Always drop the trace
span.setTag(tags.MANUAL_DROP)
//method impl follows

Manually keep a trace:

using Datadog.Trace;

using(var scope = Tracer.Instance.StartActive(operationName))
{
    var span = scope.Span;

    // Always keep this trace
    span.SetTag(Tags.ManualKeep, "true");
    //method impl follows
}

Manually drop a trace:

using Datadog.Trace;

using(var scope = Tracer.Instance.StartActive(operationName))
{
    var span = scope.Span;

    // Always drop this trace
    span.SetTag(Tags.ManualDrop, "true");
    //method impl follows
}

Manually keep a trace:

<?php
  $tracer = \OpenTracing\GlobalTracer::get();
  $span = $tracer->getActiveSpan();

  if (null !== $span) {
    // Always keep this trace
    $span->setTag(\DDTrace\Tag::MANUAL_KEEP, true);
    //method impl follows
  }
?>

Manually drop a trace:

<?php
  $tracer = \OpenTracing\GlobalTracer::get();
  $span = $tracer->getActiveSpan();

  if (null !== $span) {
    // Always drop this trace
    $span->setTag(\DDTrace\Tag::MANUAL_DROP, true);
    //method impl follows
  }
?>

Manually keep a trace:

...
#include <datadog/tags.h>
...

auto tracer = ...
auto span = tracer->StartSpan("operation_name");
// Always keep this trace
span->SetTag(datadog::tags::manual_keep, {});
//method impl follows

Manually drop a trace:

...
#include <datadog/tags.h>
...

auto tracer = ...
auto another_span = tracer->StartSpan("operation_name");
// Always drop this trace

another_span->SetTag(datadog::tags::manual_drop, {});
//method impl follows

Note that trace priority should be manually controlled only before any context propagation. If this happens after the propagation of a context, the system can’t ensure that the entire trace is kept across services. Manually controlled trace priority is set at tracing client location, the trace can still be dropped by Agent or server location based on the sampling rules.

Trace Storage

Individual traces are stored for up to 6 months. To determine how long a particular trace will be stored, the Agent makes a sampling decision early in the trace’s lifetime. In Datadog backend, sampled traces are retained according to time buckets:

Retention bucket% of stream kept
6 hours100%
Current day (UTC time)25%
6 days10%
6 months1%

Note: Datadog does not sample Synthetics APM traces. All received traces are stored for 6 hours, and the above stated percent of traces over time.

That is to say, on a given day you would see in the UI:

  • 100% of sampled traces from the last six hours
  • 25% of those from the previous hours of the current calendar day (starting at 00:00 UTC)
  • 10% from the previous six calendar days
  • 1% of those from the previous six months (starting from the first day of the month six months ago)
  • 0% of traces older than six months

For example, at 9:00am UTC Wed, 12/20 you would see:

  • 100% of traces sampled on Wed 12/20 03:00 - 09:00
  • 25% of traces sampled on Wed 12/20 00:00 - Wed 12/20 02:59
  • 10% of traces sampled on Thurs 12/14 00:00 - Tue 12/19 23:59
  • 1% of traces sampled on 7/1 00:00 - 12/13 23:59
  • 0% of traces before 7/1 00:00

Once a trace has been viewed by opening a full page, it continues to be available by using its trace ID in the URL: https://app.datadoghq.com/apm/trace/<TRACE_ID>. This is true even if it “expires” from the UI. This behavior is independent of the UI retention time buckets.

Further Reading