Logging is here!

Kafka

Agent Check Agent Check

Supported OS: Linux Mac OS Windows

Overview

Connect Kafka to Datadog in order to:

  • Visualize the performance of your cluster in real time
  • Correlate the performance of Kafka with the rest of your applications

This check has a limit of 350 metrics per instance. The number of returned metrics is indicated in the info page. You can specify the metrics you are interested in by editing the configuration below. To learn how to customize the metrics to collect visit the JMX Checks documentation for more detailed instructions.

To collect Kafka consumer metrics, see the kafka_consumer check.

Setup

Installation

The Agent’s Kafka check is included in the Datadog Agent package, so you don’t need to install anything else on your Kafka nodes.

The check collects metrics via JMX, so you’ll need a JVM on each kafka node so the Agent can fork jmxfetch. You can use the same JVM that Kafka uses.

Configuration

Edit the kafka.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory.

Metric Collection

The following instructions are for the Datadog agent >= 5.0. For agents before that, refer to the older documentation.

Kafka bean names depend on the exact Kafka version you’re running. You should always use the example that comes packaged with the Agent as a base since that will be the most up-to-date configuration. Use this sample configuration file as an example, but note that the version there may be for a newer version of the Agent than what you’ve got installed.

After you’ve configured kafka.yaml, restart the Agent to begin sending Kafka metrics to Datadog.

Log Collection

Available for Agent >6.0

Kafka uses the log4j logger per default. To activate the logging into a file and customize the format edit the log4j.properties file:

# Set root logger level to INFO and its only appender to R
log4j.rootLogger=INFO, R
log4j.appender.R.File=/var/log/kafka/server.log
log4j.appender.R.layout=org.apache.log4j.PatternLayout
log4j.appender.R.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p [%t] %c{1}:%L - %m%n

By default, our integration pipeline support the following conversion patterns:

  %d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
  %d [%t] %-5p %c - %m%n
  %r [%t] %p %c %x - %m%n

Make sure you clone and edit the integration pipeline if you have a different format.

  • Collecting logs is disabled by default in the Datadog Agent, enable it in your datadog.yaml file with:
  logs_enabled: true
  • Add this configuration block to your kafka.d/conf.yaml file to start collecting your Kafka Logs:
  logs:
    - type: file
      path: /var/log/kafka/server.log
      source: kafka
      service: myapp
      #To handle multi line that starts with yyyy-mm-dd use the following pattern
      #log_processing_rules:
      #  - type: multi_line
      #    name: log_start_with_date
      #    pattern: \d{4}\-(0?[1-9]|1[012])\-(0?[1-9]|[12][0-9]|3[01])

Change the path and service parameter values and configure them for your environment. See the sample kafka.d/conf.yaml for all available configuration options.

Learn more about log collection in the log documentation

Validation

Run the Agent’s status subcommand and look for kafka under the Checks section.

Data Collected

Metrics

kafka.net.bytes_out
(gauge)
Outgoing byte rate.
shown as byte
kafka.net.bytes_in
(gauge)
Incoming byte rate.
shown as byte
kafka.net.bytes_rejected
(gauge)
Rejected byte rate.
shown as byte
kafka.messages_in
(gauge)
Incoming message rate.
shown as message
kafka.request.fetch.failed
(gauge)
Number of client fetch request failures.
shown as request
kafka.request.fetch.failed_per_second
(gauge)
Rate of client fetch request failures per second.
shown as request
kafka.request.produce.time.avg
(gauge)
Average time for a produce request.
shown as request
kafka.request.produce.time.99percentile
(gauge)
Time for produce requests for 99th percentile.
shown as request
kafka.request.produce.failed_per_second
(gauge)
Rate of failed produce requests per second.
shown as request
kafka.request.produce.failed
(gauge)
Number of failed produce requests.
shown as request
kafka.request.fetch.time.avg
(gauge)
Average time per fetch request.
shown as millisecond
kafka.request.fetch.time.99percentile
(gauge)
Time for fetch requests for 99th percentile.
shown as millisecond
kafka.request.update_metadata.time.avg
(gauge)
Average time for a request to update metadata.
shown as millisecond
kafka.request.update_metadata.time.99percentile
(gauge)
Time for update metadata requests for 99th percentile.
shown as millisecond
kafka.request.metadata.time.avg
(gauge)
Average time for metadata request.
shown as millisecond
kafka.request.metadata.time.99percentile
(gauge)
Time for metadata requests for 99th percentile.
shown as millisecond
kafka.request.offsets.time.avg
(gauge)
Average time for an offset request.
shown as millisecond
kafka.request.offsets.time.99percentile
(gauge)
Time for offset requests for 99th percentile.
shown as millisecond
kafka.request.handler.avg.idle.pct
(gauge)
Average fraction of time the request handler threads are idle.
shown as fraction
kafka.replication.isr_shrinks
(gauge)
Rate of replicas leaving the ISR pool.
shown as node
kafka.replication.isr_expands
(gauge)
Rate of replicas joining the ISR pool.
shown as node
kafka.replication.leader_elections
(gauge)
Leader election rate.
shown as event
kafka.replication.unclean_leader_elections
(gauge)
Unclean leader election rate.
shown as event
kafka.replication.under_replicated_partitions
(gauge)
Number of unreplicated partitions.
kafka.log.flush_rate
(gauge)
Log flush rate.
shown as flush
kafka.consumer.delayed_requests
(gauge)
Number of delayed consumer requests.
shown as request
kafka.consumer.expires_per_second
(gauge)
Rate of delayed consumer request expiration.
shown as eviction
kafka.expires_sec
(gauge)
Rate of delayed producer request expiration.
shown as eviction
kafka.follower.expires_per_second
(gauge)
Rate of request expiration on followers.
shown as eviction
kafka.producer.available_buffer_bytes
(gauge)
The total amount of buffer memory that is not being used (either unallocated or in the free list)
shown as byte
kafka.producer.batch_size_avg
(gauge)
The average number of bytes sent per partition per-request.
shown as byte
kafka.producer.compression_rate_avg
(rate)
The average compression rate of record batches.
shown as record
kafka.producer.bufferpool_wait_time
(gauge)
The fraction of time an appender waits for space allocation.
kafka.producer.compression_rate
(gauge)
The average compression rate of record batches for a topic
shown as record
kafka.producer.delayed_requests
(gauge)
Number of producer requests delayed.
shown as request
kafka.producer.expires_per_seconds
(gauge)
Rate of producer request expiration.
shown as eviction
kafka.producer.batch_size_max
(gauge)
The max number of bytes sent per partition per-request.
shown as byte
kafka.producer.record_send_rate
(gauge)
The average number of records sent per second for a topic
shown as record
kafka.producer.record_retry_rate
(gauge)
The average per-second number of retried record sends for a topic
shown as record
kafka.producer.record_error_rate
(gauge)
The average per-second number of retried record sends for a topic
shown as error
kafka.producer.records_per_request
(gauge)
The average number of records sent per second.
shown as record
kafka.producer.record_queue_time_avg
(gauge)
The average time in ms record batches spent in the record accumulator.
shown as millisecond
kafka.producer.record_queue_time_max
(gauge)
The maximum time in ms record batches spent in the record accumulator.
shown as millisecond
kafka.producer.record_size_avg
(gauge)
The average record size.
shown as byte
kafka.producer.record_size_max
(gauge)
The maximum record size.
shown as byte
kafka.producer.request_rate
(gauge)
Number of producer requests per second.
shown as request
kafka.producer.response_rate
(gauge)
Number of producer responses per second.
shown as response
kafka.producer.requests_in_flight
(gauge)
The current number of in-flight requests awaiting a response.
shown as request
kafka.producer.request_latency_avg
(gauge)
Producer average request latency.
shown as millisecond
kafka.producer.request_latency_max
(gauge)
The maximum request latency in ms.
shown as millisecond
kafka.producer.bytes_out
(gauge)
Producer bytes out rate.
shown as byte
kafka.producer.metadata_age
(gauge)
The age in seconds of the current producer metadata being used.
shown as second
kafka.producer.message_rate
(gauge)
Producer message rate.
shown as message
kafka.producer.buffer_bytes_total
(gauge)
The maximum amount of buffer memory the client can use (whether or not it is currently used).
shown as byte
kafka.producer.io_wait
(gauge)
Producer I/O wait time.
shown as nanosecond
kafka.producer.throttle_time_avg
(gauge)
The average time in ms a request was throttled by a broker.
shown as millisecond
kafka.producer.throttle_time_max
(gauge)
The maximum time in ms a request was throttled by a broker.
shown as millisecond
kafka.producer.waiting_threads
(gauge)
The number of user threads blocked waiting for buffer memory to enqueue their records.
shown as thread
kafka.consumer.max_lag
(gauge)
Maximum consumer lag.
shown as offset
kafka.consumer.fetch_rate
(gauge)
The minimum rate at which the consumer sends fetch requests to a broker.
shown as request
kafka.consumer.fetch_size_avg
(gauge)
The average number of bytes fetched per request for a specific topic.
shown as request
kafka.consumer.fetch_size_max
(gauge)
The maximum number of bytes fetched per request for a specific topic.
shown as request
kafka.consumer.bytes_consumed
(gauge)
The average number of bytes consumed per second for a specific topic.
shown as byte
kafka.consumer.bytes_in
(gauge)
Consumer bytes in rate.
shown as byte
kafka.consumer.messages_in
(gauge)
Rate of consumer message consumption.
shown as message
kafka.consumer.zookeeper_commits
(gauge)
Rate of offset commits to ZooKeeper.
shown as write
kafka.consumer.kafka_commits
(gauge)
Rate of offset commits to Kafka.
shown as write
kafka.consumer.records_consumed
(gauge)
The average number of records consumed per second for a specific topic.
shown as record
kafka.consumer.records_per_request_avg
(gauge)
The average number of records in each request for a specific topic.
shown as record

Events

The Kafka check does not include any events at this time.

Service Checks

kafka.can_connect

Returns CRITICAL if the Agent is unable to connect to and collect metrics from the monitored Kafka instance. Returns OK otherwise.

Troubleshooting

Further Reading


Mistake in the docs? Feel free to contribute!

Agent Check: Kafka Consumer

Overview

This Agent check only collects metrics for message offsets. If you want to collect JMX metrics from the Kafka brokers or Java-based consumers/producers, see the kafka check.

This check fetches the highwater offsets from the Kafka brokers, consumer offsets that are stored in kafka or zookeeper (for old-style consumers), and the calculated consumer lag (which is the difference between the broker offset and the consumer offset).

Setup

Installation

The Agent’s Kafka consumer check is included in the Datadog Agent package, so you don’t need to install anything else on your Kafka nodes.

Configuration

Create a kafka_consumer.yaml file using this sample configuration file as an example. Then restart the Datadog Agent to start sending metrics to Datadog.

Validation

Run the Agent’s status subcommand and look for kafka_consumer under the Checks section.

Data Collected

Metrics

kafka.broker_offset
(gauge)
Current message offset on broker.
shown as offset
kafka.consumer_lag
(gauge)
Lag in messages between consumer and broker.
shown as offset
kafka.consumer_offset
(gauge)
Current message offset on consumer.
shown as offset

Events

consumer_lag:

The Datadog Agent emits an event when the value of the consumer_lag metric goes below 0, tagging it with topic, partition and consumer_group.

Service Checks

The Kafka-consumer check does not include any service checks at this time.

Troubleshooting

Further Reading


Mistake in the docs? Feel free to contribute!