Redpanda

Supported OS Linux Windows

Integrationv1.1.0

Overview

Redpanda is a Kafka API-compatible streaming platform for mission-critical workloads.

Connect Datadog with Redpanda to view key metrics and add additional metric groups based on specific user needs.

Setup

Installation

  1. Download and launch the Datadog Agent.
  2. Manually install the Redpanda integration. See Use Community Integrations for more details based on the environment.

Host

To configure this check for an Agent running on a host, run datadog-agent integration install -t datadog-redpanda==<INTEGRATION_VERSION>.

Containerized

For containerized environments, the best way to use this integration with the Docker Agent is to build the Agent with the Redpanda integration installed.

To build an updated version of the Agent:

  1. Use the following Dockerfile:
FROM gcr.io/datadoghq/agent:latest

ARG INTEGRATION_VERSION=1.0.0

RUN agent integration install -r -t datadog-redpanda==${INTEGRATION_VERSION}
  1. Build the image and push it to your private Docker registry.

  2. Upgrade the Datadog Agent container image. If you are using a Helm chart, modify the agents.image section in the values.yaml file to replace the default agent image:

agents:
  enabled: true
  image:
    tag: <NEW_TAG>
    repository: <YOUR_PRIVATE_REPOSITORY>/<AGENT_NAME>
  1. Use the new values.yaml file to upgrade the Agent:
helm upgrade -f values.yaml <RELEASE_NAME> datadog/datadog

Configuration

Host

Metric collection

To start collecting your Redpanda performance data:

  1. Edit the redpanda.d/conf.yaml file in the conf.d/ folder at the root of your Agent’s configuration directory. See the sample redpanda.d/conf.yaml.example file for all available configuration options.

  2. Restart the Agent.

Log collection

By default, collecting logs is disabled in the Datadog Agent. Log collection is available for Agent v6.0+.

  1. To enable logs, add the following in your datadog.yaml file:

    logs_enabled: true
    
  2. Make sure dd-agent user is member of systemd-journal group, if not, run following command as root:

    usermod -a -G systemd-journal dd-agent
    
  3. Add the following in your redpanda.d/conf.yaml file to start collecting your Redpanda logs:

     logs:
     - type: journald
       source: redpanda
    

Containerized

Metric collection

For containerized environments, Autodiscovery is configured by default after the Redpanda check integrates in the Datadog Agent image.

Metrics are automatically collected in Datadog’s server. For more information, see Autodiscovery Integration Templates.

Log collection

By default, log collection is disabled in the Datadog Agent. Log collection is available for Agent v6.0+.

To enable logs, see Kubernetes Log Collection.

ParameterValue
<LOG_CONFIG>{"source": "redpanda", "service": "redpanda_cluster"}

Validation

Run the Agent’s status subcommand and look for redpanda under the Checks section.

Data Collected

Metrics

redpanda.alien.receive_batch_queue_length
(gauge)
Current receive batch queue length
redpanda.alien.total_received_messages
(count)
Total number of received messages
redpanda.alien.total_sent_messages
(count)
Total number of sent messages
redpanda.application.build
(gauge)
Redpanda build information
redpanda.application.uptime
(gauge)
Redpanda uptime in milliseconds
Shown as millisecond
redpanda.cluster.partition_committed_offset
(gauge)
Partition commited offset. i.e. safely persisted on majority of replicas
redpanda.cluster.partition_end_offset
(gauge)
Last offset stored by current partition on this node
redpanda.cluster.partition_high_watermark
(gauge)
Partion high watermark i.e. highest consumable offset
redpanda.cluster.partition_last_stable_offset
(gauge)
Last stable offset
redpanda.cluster.partition_leader
(gauge)
Flag indicating if this partition instance is a leader
redpanda.cluster.partition_leader_id
(gauge)
Id of current partition leader
redpanda.cluster.partition_records_fetched
(count)
Total number of records fetched
Shown as record
redpanda.cluster.partition_records_produced
(count)
Total number of records produced
Shown as record
redpanda.cluster.partition_under_replicated_replicas
(gauge)
Number of under replicated replicas
redpanda.httpd.connections_current
(gauge)
The current number of open connections
Shown as connection
redpanda.httpd.connections_total
(count)
The total number of connections opened
Shown as connection
redpanda.httpd.read_errors
(count)
The total number of errors while reading http requests
Shown as error
redpanda.httpd.reply_errors
(count)
The total number of errors while replying to http
Shown as error
redpanda.httpd.requests_served
(count)
The total number of http requests served
Shown as request
redpanda.internal.rpc_active_connections
(gauge)
internal_rpc: Currently active connections
Shown as connection
redpanda.internal.rpc_connection_close_errors
(count)
internal_rpc: Number of errors when shutting down the connection
Shown as connection
redpanda.internal.rpc_connects
(count)
internal_rpc: Number of accepted connections
Shown as connection
redpanda.internal.rpc_consumed_mem_bytes
(count)
internal_rpc: Memory consumed by request processing
Shown as byte
redpanda.internal.rpc_corrupted_headers
(count)
internal_rpc: Number of requests with corrupted headers
redpanda.internal.rpc_dispatch_handler_latency.count
(count)
internal_rpc: Latency
Shown as millisecond
redpanda.internal.rpc_dispatch_handler_latency.sum
(gauge)
internal_rpc: Latency
Shown as millisecond
redpanda.internal.rpc_latency.count
(count)
Internal RPC service latency
Shown as millisecond
redpanda.internal.rpc_latency.sum
(gauge)
Internal RPC service latency
Shown as millisecond
redpanda.internal.rpc_max_service_mem_bytes
(count)
internal_rpc: Maximum memory allowed for RPC
Shown as byte
redpanda.internal.rpc_method_not_found_errors
(count)
internal_rpc: Number of requests with not available RPC method
Shown as error
redpanda.internal.rpc_received_bytes
(count)
internal_rpc: Number of bytes received from the clients in valid requests
Shown as byte
redpanda.internal.rpc_requests_blocked_memory
(count)
internal_rpc: Number of requests blocked in memory backpressure
Shown as request
redpanda.internal.rpc_requests_completed
(count)
internal_rpc: Number of successful requests
Shown as request
redpanda.internal.rpc_requests_pending
(gauge)
internal_rpc: Number of requests being processed by server
Shown as request
redpanda.internal.rpc_sent_bytes
(count)
internal_rpc: Number of bytes sent to clients
Shown as byte
redpanda.internal.rpc_service_errors
(count)
internal_rpc: Number of service errors
Shown as error
redpanda.io.queue_delay
(gauge)
random delay time in the queue
Shown as second
redpanda.io.queue_disk_queue_length
(gauge)
Number of requests in the disk
redpanda.io.queue_queue_length
(gauge)
Number of requests in the queue
redpanda.io.queue_shares
(gauge)
current amount of shares
redpanda.io.queue_total_bytes
(count)
Total bytes passed in the queue
Shown as byte
redpanda.io.queue_total_delay_sec
(count)
Total time spent in the queue
Shown as second
redpanda.io.queue_total_exec_sec
(count)
Total time spent in disk
Shown as second
redpanda.io.queue_total_operations
(count)
Total bytes passed in the queue
Shown as operation
redpanda.kafka.fetch_sessions_cache_mem_usage_bytes
(gauge)
Fetch sessions cache memory usage in bytes
Shown as byte
redpanda.kafka.fetch_sessions_cache_sessions_count
(gauge)
Total number of fetch sessions
redpanda.kafka.latency_fetch_latency_us.count
(count)
Fetch Latency
Shown as millisecond
redpanda.kafka.latency_fetch_latency_us.sum
(gauge)
Fetch Latency
Shown as millisecond
redpanda.kafka.latency_produce_latency_us.count
(count)
Produce Latency
Shown as millisecond
redpanda.kafka.latency_produce_latency_us.sum
(gauge)
Produce Latency
Shown as millisecond
redpanda.kafka.rpc_active_connections
(gauge)
kafka_rpc: Currently active connections
Shown as connection
redpanda.kafka.rpc_connection_close_errors
(count)
kafka_rpc: Number of errors when shutting down the connection
Shown as error
redpanda.kafka.rpc_connects
(count)
kafka_rpc: Number of accepted connections
Shown as connection
redpanda.kafka.rpc_consumed_mem_bytes
(count)
kafka_rpc: Memory consumed by request processing
Shown as byte
redpanda.kafka.rpc_corrupted_headers
(count)
kafka_rpc: Number of requests with corrupted headers
redpanda.kafka.rpc_dispatch_handler_latency.count
(count)
kafka_rpc: Latency
Shown as millisecond
redpanda.kafka.rpc_dispatch_handler_latency.sum
(gauge)
kafka_rpc: Latency
Shown as millisecond
redpanda.kafka.rpc_max_service_mem_bytes
(count)
kafka_rpc: Maximum memory allowed for RPC
Shown as byte
redpanda.kafka.rpc_method_not_found_errors
(count)
kafka_rpc: Number of requests with not available RPC method
Shown as error
redpanda.kafka.rpc_received_bytes
(count)
kafka_rpc: Number of bytes received from the clients in valid requests
Shown as byte
redpanda.kafka.rpc_requests_blocked_memory
(count)
kafka_rpc: Number of requests blocked in memory backpressure
Shown as request
redpanda.kafka.rpc_requests_completed
(count)
kafka_rpc: Number of successful requests
Shown as request
redpanda.kafka.rpc_requests_pending
(gauge)
kafka_rpc: Number of requests being processed by server
Shown as request
redpanda.kafka.rpc_sent_bytes
(count)
kafka_rpc: Number of bytes sent to clients
Shown as byte
redpanda.kafka.rpc_service_errors
(count)
kafka_rpc: Number of service errors
Shown as error
redpanda.kafka.group_offset
(gauge)
consumer lag offset
redpanda.leader.balancer_leader_transfer_error
(count)
Number of errors attempting to transfer leader
Shown as error
redpanda.leader.balancer_leader_transfer_no_improvement
(count)
Number of times no balance improvement was found
redpanda.leader.balancer_leader_transfer_succeeded
(count)
Number of successful leader transfers
Shown as success
redpanda.leader.balancer_leader_transfer_timeout
(count)
Number of timeouts attempting to transfer leader
Shown as timeout
redpanda.memory.allocated_memory
(count)
Allocated memeory size in bytes
Shown as byte
redpanda.memory.cross_cpu_free_operations
(count)
Total number of cross cpu free
Shown as operation
redpanda.memory.free_memory
(count)
Free memeory size in bytes
Shown as byte
redpanda.memory.free_operations
(count)
Total number of free operations
Shown as operation
redpanda.memory.malloc_live_objects
(gauge)
Number of live objects
Shown as object
redpanda.memory.malloc_operations
(count)
Total number of malloc operations
Shown as operation
redpanda.memory.reclaims_operations
(count)
Total reclaims operations
Shown as operation
redpanda.memory.total_memory
(count)
Total memeory size in bytes
Shown as byte
redpanda.pandaproxy.request_latency.count
(count)
Request latency
Shown as millisecond
redpanda.pandaproxy.request_latency.sum
(gauge)
Request latency
Shown as millisecond
redpanda.raft.done_replicate_requests
(count)
Number of finished replicate requests
Shown as request
redpanda.raft.group_count
(gauge)
Number of raft groups
redpanda.raft.heartbeat_requests_errors
(count)
Number of failed heartbeat requests
Shown as error
redpanda.raft.leader_for
(gauge)
Number of groups for which node is a leader
redpanda.raft.leadership_changes
(count)
Number of leadership changes
redpanda.raft.log_flushes
(count)
Number of log flushes
Shown as flush
redpanda.raft.log_truncations
(count)
Number of log truncations
redpanda.raft.received_append_requests
(count)
Number of append requests received
redpanda.raft.received_vote_requests
(count)
Number of vote requests received
redpanda.raft.recovery_requests_errors
(count)
Number of failed recovery requests
Shown as error
redpanda.raft.replicate_ack_all_requests
(count)
Number of replicate requests with quorum ack consistency
Shown as request
redpanda.raft.replicate_ack_leader_requests
(count)
Number of replicate requests with leader ack consistency
Shown as request
redpanda.raft.replicate_ack_none_requests
(count)
Number of replicate requests with no ack consistency
Shown as request
redpanda.raft.replicate_request_errors
(count)
Number of failed replicate requests
Shown as error
redpanda.raft.sent_vote_requests
(count)
Number of vote requests sent
Shown as request
redpanda.reactor.abandoned_failed_futures
(count)
Total number of abandoned failed futures futures destroyed while still containing an exception
redpanda.reactor.aio_bytes_read
(count)
Total aio-reads bytes
Shown as byte
redpanda.reactor.aio_bytes_write
(count)
Total aio-writes bytes
Shown as byte
redpanda.reactor.aio_errors
(count)
Total aio errors
Shown as error
redpanda.reactor.aio_reads
(count)
Total aio-reads operations
Shown as read
redpanda.reactor.aio_writes
(count)
Total aio-writes operations
Shown as write
redpanda.reactor.cpp_exceptions
(count)
Total number of C++ exceptions
Shown as exception
redpanda.reactor.cpu_busy_ms
(count)
Total cpu busy time in milliseconds
Shown as millisecond
redpanda.reactor.cpu_steal_time_ms
(count)
Total steal time the time in which some other process was running while Seastar was not trying to run (not sleeping).Because this is in userspace some time that could be legitimally thought as steal time is not accounted as such. For example if we are sleeping and can wake up but the kernel hasn't woken us up yet.
Shown as millisecond
redpanda.reactor.fstream_read_bytes
(count)
Counts bytes read from disk file streams. A high rate indicates high disk activity. Divide by fstream_reads to determine average read size.
Shown as byte
redpanda.reactor.fstream_read_bytes_blocked
(count)
Counts the number of bytes read from disk that could not be satisfied from read-ahead buffers and had to block. Indicates short streams or incorrect read ahead configuration.
Shown as byte
redpanda.reactor.fstream_reads
(count)
Counts reads from disk file streams. A high rate indicates high disk activity. Contrast with other fstream_read* counters to locate bottlenecks.
Shown as read
redpanda.reactor.fstream_reads_ahead_bytes_discarded
(count)
Counts the number of buffered bytes that were read ahead of time and were discarded because they were not needed wasting disk bandwidth. Indicates over-eager read ahead configuration.
Shown as byte
redpanda.reactor.fstream_reads_aheads_discarded
(count)
Counts the number of times a buffer that was read ahead of time and was discarded because it was not needed wasting disk bandwidth. Indicates over-eager read ahead configuration.
Shown as read
redpanda.reactor.fstream_reads_blocked
(count)
Counts the number of times a disk read could not be satisfied from read-ahead buffers and had to block. Indicates short streams or incorrect read ahead configuration.
Shown as read
redpanda.reactor.fsyncs
(count)
Total number of fsync operations
redpanda.reactor.io_threaded_fallbacks
(count)
Total number of io-threaded-fallbacks operations
Shown as read
redpanda.reactor.logging_failures
(count)
Total number of logging failures
redpanda.reactor.polls
(count)
Number of times pollers were executed
redpanda.reactor.tasks_pending
(gauge)
Number of pending tasks in the queue
redpanda.reactor.tasks_processed
(count)
Total tasks processed
redpanda.reactor.timers_pending
(count)
Number of tasks in the timer-pending queue
redpanda.reactor.utilization
(gauge)
CPU utilization
Shown as percent
redpanda.rpc.client_active_connections
(gauge)
Currently active connections
Shown as connection
redpanda.rpc.client_client_correlation_errors
(count)
Number of errors in client correlation id
Shown as error
redpanda.rpc.client_connection_errors
(count)
Number of connection errors
Shown as connection
redpanda.rpc.client_connects
(count)
Connection attempts
Shown as connection
redpanda.rpc.client_corrupted_headers
(count)
Number of responses with corrupted headers
redpanda.rpc.client_in_bytes
(count)
Total number of bytes sent (including headers)
Shown as byte
redpanda.rpc.client_out_bytes
(count)
Total number of bytes received
Shown as byte
redpanda.rpc.client_read_dispatch_errors
(count)
Number of errors while dispatching responses
Shown as read
redpanda.rpc.client_request_errors
(count)
Number or requests errors
Shown as error
redpanda.rpc.client_request_timeouts
(count)
Number or requests timeouts
Shown as timeout
redpanda.rpc.client_requests
(count)
Number of requests
Shown as request
redpanda.rpc.client_requests_blocked_memory
(count)
Number of requests that are blocked because of insufficient memory
Shown as request
redpanda.rpc.client_requests_pending
(gauge)
Number of requests pending
Shown as request
redpanda.rpc.client_server_correlation_errors
(count)
Number of responses with wrong correlation id
Shown as error
redpanda.scheduler.queue_length
(gauge)
Size of backlog on this queue in tasks; indicates whether the queue is busy and/or contended
redpanda.scheduler.runtime_ms
(count)
Accumulated runtime of this task queue; an increment rate of 1000ms per second indicates full utilization
Shown as millisecond
redpanda.scheduler.shares
(gauge)
Shares allocated to this queue
redpanda.scheduler.starvetime_ms
(count)
Accumulated starvation time of this task queue; an increment rate of 1000ms per second indicates the scheduler feels really bad
Shown as millisecond
redpanda.scheduler.tasks_processed
(count)
Count of tasks executing on this queue; indicates together with runtime_ms indicates length of tasks
Shown as task
redpanda.scheduler.time_spent_on_task_quota_violations_ms
(count)
Total amount in milliseconds we were in violation of the task quota
Shown as millisecond
redpanda.scheduler.waittime_ms
(count)
Accumulated waittime of this task queue; an increment rate of 1000ms per second indicates queue is waiting for something (e.g. IO)
Shown as millisecond
redpanda.stall.detector_reported
(count)
Total number of reported stalls look in the traces for the exact reason
redpanda.storage.compaction_backlog_controller_backlog_size
(gauge)
controller backlog
redpanda.storage.compaction_backlog_controller_error
(gauge)
current controller error i.e difference between set point and backlog size
Shown as error
redpanda.storage.compaction_backlog_controller_shares
(gauge)
controller output i.e. number of shares
redpanda.storage.kvstore_cached_bytes
(count)
Size of the database in memory
Shown as byte
redpanda.storage.kvstore_entries_fetched
(count)
Number of entries fetched
Shown as read
redpanda.storage.kvstore_entries_removed
(count)
Number of entries removaled
redpanda.storage.kvstore_entries_written
(count)
Number of entries written
Shown as write
redpanda.storage.kvstore_key_count
(count)
Number of keys in the database
redpanda.storage.kvstore_segments_rolled
(count)
Number of segments rolled
redpanda.storage.log_batch_parse_errors
(count)
Number of batch parsing (reading) errors
Shown as error
redpanda.storage.log_batch_write_errors
(count)
Number of batch write errors
Shown as write
redpanda.storage.log_batches_read
(count)
Total number of batches read
Shown as read
redpanda.storage.log_batches_written
(count)
Total number of batches written
Shown as write
redpanda.storage.log_cache_hits
(count)
Reader cache hits
Shown as hit
redpanda.storage.log_cache_misses
(count)
Reader cache misses
Shown as miss
redpanda.storage.log_cached_batches_read
(count)
Total number of cached batches read
Shown as read
redpanda.storage.log_cached_read_bytes
(count)
Total number of cached bytes read
Shown as byte
redpanda.storage.log_compacted_segment
(count)
Number of compacted segments
redpanda.storage.log_compaction_ratio
(count)
Average segment compaction ratio
redpanda.storage.log_corrupted_compaction_indices
(count)
Number of times we had to re-construct the .compaction index on a segment
redpanda.storage.log_log_segments_active
(count)
Number of active log segments
redpanda.storage.log_log_segments_created
(count)
Number of created log segments
redpanda.storage.log_log_segments_removed
(count)
Number of removed log segments
redpanda.storage.log_partition_size
(gauge)
Current size of partition in bytes
Shown as byte
redpanda.storage.log_read_bytes
(count)
Total number of bytes read
Shown as byte
redpanda.storage.log_readers_added
(count)
Number of readers added to cache
Shown as read
redpanda.storage.log_readers_evicted
(count)
Number of readers evicted from cache
Shown as read
redpanda.storage.log_written_bytes
(count)
Total number of bytes written
Shown as byte

Events

The Redpanda integration does not include any events.

Service Checks

redpanda.openmetrics.health
Returns CRITICAL if the check cannot access the metrics endpoint. Returns OK otherwise.
Statuses: ok, critical

Troubleshooting

Need help? Contact Datadog support.