Agent Retry and Buffering Logic

이 페이지는 아직 한국어로 제공되지 않습니다. 번역 작업 중입니다.
현재 번역 프로젝트에 대한 질문이나 피드백이 있으신 경우 언제든지 연락주시기 바랍니다.

Overview

This guide describes the Datadog Agent’s behavior when it fails to send HTTP requests to the Metrics, Logs, APM, and Processes intake endpoints. All retry strategies use exponential backoff with randomized jitter. See the backoff source code for implementation details.

A failed HTTP request in this guide refers to any request that does not result in a 2xx HTTP response.

Metrics retry strategy

The Agent retries failed HTTP requests using an exponential backoff strategy. The Agent uses the following default retry configurations for the metrics intake:

  • Base backoff time: 2 seconds
  • Maximum backoff time: 64 seconds
  • Maximum backoff time is reached after 6 retries

The Agent retries failed requests for the following scenarios:

  • Network timeouts
  • HTTP 4xx responses (see note for exceptions)
  • HTTP 5xx responses
For 4xx responses, the Agent does not retry requests with status codes 400, 403, or 413.
Requests that return a 404 response are retried because they often indicate a configuration or availability issue that could be resolved.

Metrics buffering mechanisms and limits

When the Agent fails to send a metric to the Datadog intake, it compresses and stores the metric in an in-memory retry buffer. See Buffer configurations for the available settings.

The Agent also supports an optional on-disk retry buffer. If you enable this setting, the Agent:

  1. Fills the in-memory buffer until it is full
  2. Evicts older payloads from memory and serializes them to disk
  3. Retries payloads in the following order:
    1. In-memory payloads (newest first)
    2. On-disk payloads (newest first)

This prioritization helps ensure that the Agent sends recent and live metrics before it backfills older data.

Buffer configurations

The Datadog Agent has the following default configurations for metric retry buffering:

  • On-disk buffer size: 2 GB
  • Maximum disk usage ratio: 0.8
  • Maximum in-memory buffer size: 15 MB

You can configure the default maximum in-memory buffer size using the forwarder_retry_queue_payloads_max_size setting.

Restart and shutdown behavior

During restart, the Agent:

  • Drops in-memory payloads
  • Preserves and resends on-disk payloads

During shutdown, the Agent:

  • Flushes in-flight requests
  • Does not flush payloads in retry queues (both in-memory and on-disk)

Logs retry strategy

The Logs Agent retries failed HTTP requests indefinitely using an exponential backoff strategy. The Agent uses the following default retry configurations for the logs intake:

  • Base backoff time: 2 seconds
  • Maximum backoff time: 120 seconds

The Agent retries failed log payloads until the logs intake endpoint becomes available.

The Logs Agent does not retry requests with status codes 400, 401, 403, or 413.

Logs buffering mechanisms and limits

Backpressure and consumption

The Logs Agent guarantees log delivery during transmission. When a payload fails to send, the Agent applies backpressure by stopping reading from the log source and resuming from the last known position when the intake becomes available.

Data loss scenarios

  • Kubernetes: Log files may rotate before intake recovery
  • Host-based systems: Files may be removed by tools such as logrotate

Log buffer limits

  • HTTP logs:

    • Not configurable
  • TCP logs:

    • Buffer limit: 100 log lines
    • The Agent sends logs line by line

Registry and restart behavior

The Logs Agent maintains a registry that tracks log sources and current read offsets. The Agent flushes the registry to disk every second and reloads it when the Agent restarts. You cannot configure this process.

On restart, the Agent resumes reading from the position recorded in the registry.

Advanced shipping configuration

Dual shipping

When you enable dual shipping:

  • The Agent sends logs to the first available endpoint
  • The Agent drops payloads for any endpoint that fails
  • Log consumption continues as long as at least one endpoint succeeds

For the Agent logic when is_reliable is enabled, see Logs Dual Shipping.

APM retry strategy

The Agent retries failed APM requests using an exponential backoff strategy. The Agent uses the following default retry configurations for the APM intake:

  • Base backoff time: 2 seconds
  • Maximum backoff time: 10 seconds

The Agent retries failed requests for the following scenarios:

  • Network connectivity errors
  • HTTP 408 responses
  • HTTP 5xx responses
You cannot configure the retry behavior and retriable status codes for APM.

APM buffering mechanisms and limits

In-memory queues

The Agent compresses and stores failed APM payloads in memory, dropping them when queues are full.

Stats

  • Configurable using apm_config.stats_writer.queue_size
  • Default calculation:
    • int(max(1, max memory / payload size))
    • Example: int(max(1, (250 * 1024 * 1024) / 1500000)) = 174 payloads

Advanced shipping configuration

Dual shipping

When you enable dual shipping for the APM intake, each endpoint has an independent sender and queue.

Processes retry strategy

The Agent retries failed Processes requests using an exponential backoff strategy. The Agent uses the same default retry configurations as the metrics intake:

  • Base backoff time: 2 seconds
  • Maximum backoff time: 64 seconds
  • Maximum backoff time is reached after 6 retries

See Metrics retry strategy for complete details on retry scenarios and exceptions.

On-disk buffering is not supported for Processes.

Processes buffering mechanisms and limits

The Process Agent uses the metrics forwarder for downstream delivery. Before forwarding check results, the Process Agent stores them in an in-memory queue.

Queue mechanism

The in-memory queue buffers data when the intake is unavailable or during transmission delays.

Buffer limits

  • Queue size: 256 payloads (DefaultProcessQueueSize)
  • Queue memory: 60 MB (DefaultProcessQueueBytes)

With checks running every 10 seconds, these settings buffer approximately 30 minutes of process data.

Version-specific queue behavior

Agent versions 7.38 and earlier:

  • Process and Connections (NPM) payloads share a single queue
  • Buffer limits apply to the combined payloads
  • Buffers approximately 30 minutes of combined data

Agent versions 7.39 and later:

  • Process and Connections (NPM) payloads use separate queues
  • Each payload type has independent buffer limits
  • Default settings buffer approximately 40 minutes of process data