Buffering and Backpressure
This product is not supported for your selected
Datadog site. (
).
Overview
Observability Pipelines are designed for durability and to mitigate the impact when destinations are unavailable. If a destination is unavailable (for example, due to connection issues), the Observability Pipelines Worker retries its connection to the destination until the connection is reestablished or the destination times out. During this time, events accumulate in the pipeline components’ internal buffers, eventually blocking the source from ingesting new events. This behavior is called backpressure.
It is important to consider how backpressure propagates through your architecture in the event of a destination outage. For example, backpressure can block your application from sending logs to Observability Pipelines and those logs may contend for memory or disk resources needed by your service. To prevent backpressure from reaching your application, configure a destination buffer and size it according to your pipeline’s throughput, which helps the Worker absorb backpressure when the destination is unavailable. You may also want to configure the on-full buffer behavior to drop newest, which prevents backpressure by dropping incoming events when the buffer is full. See the destination buffers section for more information on configurable destination buffers.
All components in the Observability Pipelines Worker have an in-memory buffer to help smooth the handoff of events between components. All sources have a buffer with a capacity of 1,000 events per worker thread. Sources write events to their respective downstream buffer upon ingestion. All processors have an in-memory buffer with a capacity of 100 events, which the processors consume upstream. Source and processor buffers are not configurable.
Destination buffers
By default, destinations have an in-memory buffer with a capacity of 500 events. This buffer is configurable, enabling you to control these parameters:
- Buffer type: Either an in-memory buffer or a disk buffer
- Buffer size: The maximum byte capacity of the buffer
- Buffer on-full behavior: Determines the overflow behavior of the buffer, either block events and propagate backpressure or drop incoming events to prevent backpressure propagation.
For each of those settings, choose the option that best aligns with your logging strategy.
Choosing buffer types
In-memory buffers prioritize throughput over durability. They can handle significant bandwidth, but memory buffers do not persist between Worker restarts.
Use an in-memory buffer if you want to prevent backpressure from propagating, and losing all logs in the buffer on restart is acceptable.
Disk buffers prioritize durability over throughput. Disk buffers write to the page cache first, then flush to disk if the destination doesn’t send the data immediately. Disk buffers wait at most 500 ms before calling fsync and flushing a data file to disk. A disk buffer flushes more frequently if a data file fills up to its maximum 128 MB size before the 500 ms has elapsed since the last flush. Disk buffers are ordered, meaning events are sent downstream in the order they were written to the buffer (first in, first out).
Use a disk buffer if you need to mitigate data loss and the throughput of your pipeline is unlikely to be bottlenecked by I/O when flushing to disk.
Choosing buffer on-full behavior
Block (default): If the buffer is full, incoming events are blocked from being written to the buffer. Use this option if you want to help ensure no events are dropped.
Drop Newest: If the buffer is full, incoming events are dropped. This allows the source to continue to ingest events and prevents backpressure from propagating back to your application. See Using buffers with multiple destinations for details on how this works when you have multiple destinations.
This table compares the differences between the memory and disk buffer.
| Property | Memory Buffer | Disk Buffer |
|---|
| Default size | 500 events | Configurable Minimum buffer size: 256 MB Maximum buffer size: 500 GB |
| Performance | Higher | Lower |
| Durability through an unexpected Worker restart or crash | None | Events flushed to disk latest every 500 ms |
| Data loss due to an unexpected restart or crash | All buffered data is lost | All buffered data is retained |
| Data loss on graceful shutdown | All buffered data is lost | None, all data in the pipeline is flushed to disk before exit |
Using buffers with multiple destinations
After your events are sent through your processors, the events go through a fanout to all of your pipeline’s destinations. If backpressure propagates to the fanout from any destination, all destinations are blocked. No additional events are sent by any destination until the blocked destination resumes sending events successfully.
The drop_newest on-full behavior drops incoming events when a destination’s buffer is full. This prevents backpressure from propagating to the fanout from that destination, allowing your other destinations to continue ingesting events from the fanout. This can be helpful if you want to help ensure events are delivered reliably to one destination, but are okay with another destination dropping events if it becomes unavailable to prevent backpressure propagation.
Kubernetes persistent volumes
If you enable disk buffering for destinations, you must enable Kubernetes persistent volumes in the Observability Pipelines helm chart. With disk buffering enabled, events are first sent to the buffer and written to the persistent volumes, then sent downstream.
Buffer metrics
Use these metrics to analyze buffer performance. All metrics are emitted on a one-second interval, unless otherwise stated.
Source buffer metrics
These metrics are specific to source buffers, located downstream of a source. Each source emits its own respective buffer metrics. Note: Source buffers are not configurable, but these metrics can help monitor backpressure as it propagates to your pipeline’s source.
- Use the
component_id tag to filter or group by individual components. - Use the
component_type tag to filter or group by the source type, such as splunk_hec for the Splunk HEC source.
pipelines.source_buffer_utilization- Description: Event count in a source’s buffer.
- Metric type: histogram
pipelines.source_buffer_utilization_level- Description: Number of events in a source’s buffer.
- Metric type: gauge
pipelines.source_buffer_utilization_mean- Description: The exponentially weighted moving average (EWMA) of the number of events in the source’s buffer.
- Metric type: gauge
pipelines.source_buffer_max_size_events- Description: A source buffer’s maximum event capacity.
- Metric type: gauge
Processor buffer metrics
These metrics are specific to processor buffers, located upstream of a processor. Each processor emits its own respective buffer metrics. Note: Processor buffers are not configurable, but these metrics can help monitor backpressure as it propagates through your pipeline’s processors.
- Use the
component_id tag to filter or group by individual components. - Use the
component_type tag to filter or group by the processor type, such as quota for the Quota processor.
pipelines.transform_buffer_utilization- Description: Event count in a processor’s buffer.
- Metric type: histogram
pipelines.transform_buffer_utilization_level- Description: Event count in a processor’s buffer.
- Metric type: gauge
pipelines.transform_buffer_utilization_mean- Description: The exponentially weighted moving average (EWMA) of the number of events in a processor’s buffer.
- Metric type: gauge
pipelines.transform_buffer_max_size_events- Description: A processor buffer’s maximum event capacity.
- Metric type: gauge
Destination buffer metrics
These metrics are specific to destination buffers, located upstream of a destination. Each destination emits its own respective buffer metrics.
- Use the
component_id tag to filter or group by individual components. - Use the
component_type tag to filter or group by the destination type, such as datadog_logs for the Datadog Logs destination.
pipelines.buffer_size_events- Description: Number of events in a destination’s buffer.
- Metric type: gauge
pipelines.buffer_size_bytes- Description: Number of bytes in a destination’s buffer.
- Metric type: gauge
pipelines.buffer_received_events_total- Description: Events received by a destination’s buffer. Note: This metric represents the count per second and not the cumulative total, even though
total is in the metric name. - Metric type: counter
pipelines.buffer_received_bytes_total- Description: Bytes received by a destination’s buffer. Note: This metric represents the count per second and not the cumulative total, even though
total is in the metric name. - Metric type: counter
pipelines.buffer_sent_events_total- Description: Events sent downstream by a destination’s buffer. Note: This metric represents the count per second and not the cumulative total, even though
total is in the metric name. - Metric type: counter
pipelines.buffer_sent_bytes_total- Description: Bytes sent downstream by a destination’s buffer. Note: This metric represents the count per second and not the cumulative total, even though
total is in the metric name. - Metric type: counter
pipelines.buffer_discarded_events_total- Description: Events discarded by the buffer. Note: This metric represents the count per second and not the cumulative total, even though
total is in the metric name. - Metric type: counter
- Additional tags:
intentional:true means an incoming event was dropped because the buffer was configured to drop the newest logs when it’s full. intentional:false means the event was dropped due to an error. pipelines.buffer_discarded_bytes_total- Description: Bytes discarded by the buffer. Note: This metric represents the count per second and not the cumulative total, even though
total is in the metric name. - Metric type: counter
- Additional tags:
intentional:true means an incoming event was dropped because the buffer was configured to drop the newest logs when it’s full. intentional:false means the event was dropped due to an error.
Deprecated buffer metrics
These metrics are still emitted by the Observability Pipelines Worker for backwards compatibility. Datadog recommends using the replacements when possible.
pipelines.buffer_events- Description: Number of events in a destination’s buffer. Use
pipelines.buffer_size_events instead. - Metric type: gauge
pipelines.buffer_byte_size- Description: Number of bytes in a destination’s buffer. Use
pipelines.buffer_size_bytes instead. - Metric type: gauge
Further reading
Additional helpful documentation, links, and articles: