Datadog-Apache Kafka Integration

Overview

Connect Kafka to Datadog in order to:

  • Visualize the performance of your cluster in real time
  • Correlate the performance of Kafka with the rest of your applications

This check has a limit of 350 metrics per instance. The number of returned metrics is indicated in the info page. You can specify the metrics you are interested in by editing the configuration below. To learn how to customize the metrics to collect visit the JMX Checks documentation for more detailed instructions.

To collect Kafka consumer metrics, see the kafka_consumer check.

Setup

Installation

The Agent’s Kafka check is packaged with the Agent, so simply install the Agent on your Kafka nodes.

The check collects metrics via JMX, so you’ll need a JVM on each kafka node so the Agent can fork jmxfetch. You can use the same JVM that Kafka uses.

Configuration

The following instructions are for the Datadog agent >= 5.0. For agents before that, refer to the older documentation.

Configure a kafka.yaml in the Datadog Agent’s conf.d directory. Kafka bean names depend on the exact Kafka version you’re running. You should always use the example that comes packaged with the Agent as a base since that will be the most up-to-date configuration. Use this sample conf file as an example, but note that the version there may be for a newer version of the Agent than what you’ve got installed.

After you’ve configured kafka.yaml, restart the Agent to begin sending Kafka metrics to Datadog.

Validation

Run the Agent’s info subcommand and look for kafka under the Checks section:

  Checks
  ======
    [...]

    kafka-localhost-9999
    -------
      - instance #0 [OK]
      - Collected 8 metrics, 0 events & 0 service checks

    [...]

Compatibility

The kafka check is compatible with all major platforms.

Data Collected

Metrics

kafka.net.bytes_out
(gauge)
Outgoing byte rate.
shown as byte
kafka.net.bytes_in
(gauge)
Incoming byte rate.
shown as byte
kafka.net.bytes_rejected
(gauge)
Rejected byte rate.
shown as byte
kafka.messages_in
(gauge)
Incoming message rate.
shown as message
kafka.request.fetch.failed
(gauge)
Number of client fetch request failures.
shown as request
kafka.request.fetch.failed_per_second
(gauge)
Rate of client fetch request failures per second.
shown as request
kafka.request.produce.time.avg
(gauge)
Average time for a produce request.
shown as request
kafka.request.produce.time.99percentile
(gauge)
Time for produce requests for 99th percentile.
shown as request
kafka.request.produce.failed_per_second
(gauge)
Rate of failed produce requests per second.
shown as request
kafka.request.produce.failed
(gauge)
Number of failed produce requests.
shown as request
kafka.request.fetch.time.avg
(gauge)
Average time per fetch request.
shown as millisecond
kafka.request.fetch.time.99percentile
(gauge)
Time for fetch requests for 99th percentile.
shown as millisecond
kafka.request.update_metadata.time.avg
(gauge)
Average time for a request to update metadata.
shown as millisecond
kafka.request.update_metadata.time.99percentile
(gauge)
Time for update metadata requests for 99th percentile.
shown as millisecond
kafka.request.metadata.time.avg
(gauge)
Average time for metadata request.
shown as millisecond
kafka.request.metadata.time.99percentile
(gauge)
Time for metadata requests for 99th percentile.
shown as millisecond
kafka.request.offsets.time.avg
(gauge)
Average time for an offset request.
shown as millisecond
kafka.request.offsets.time.99percentile
(gauge)
Time for offset requests for 99th percentile.
shown as millisecond
kafka.request.handler.avg.idle.pct
(gauge)
Average fraction of time the request handler threads are idle.
shown as fraction
kafka.replication.isr_shrinks
(gauge)
Rate of replicas leaving the ISR pool.
shown as node
kafka.replication.isr_expands
(gauge)
Rate of replicas joining the ISR pool.
shown as node
kafka.replication.leader_elections
(gauge)
Leader election rate.
shown as event
kafka.replication.unclean_leader_elections
(gauge)
Unclean leader election rate.
shown as event
kafka.replication.under_replicated_partitions
(gauge)
Number of unreplicated partitions.
shown as
kafka.log.flush_rate
(gauge)
Log flush rate.
shown as flush
kafka.consumer.delayed_requests
(gauge)
Number of delayed consumer requests.
shown as request
kafka.consumer.expires_per_second
(gauge)
Rate of delayed consumer request expiration.
shown as eviction
kafka.expires_sec
(gauge)
Rate of delayed producer request expiration.
shown as eviction
kafka.follower.expires_per_second
(gauge)
Rate of request expiration on followers.
shown as eviction
kafka.producer.delayed_requests
(gauge)
Number of producer requests delayed.
shown as request
kafka.producer.expires_per_seconds
(gauge)
Rate of producer request expiration.
shown as eviction
kafka.producer.request_rate
(gauge)
Number of producer requests per second.
shown as request
kafka.producer.response_rate
(gauge)
Number of producer responses per second.
shown as response
kafka.producer.request_latency_avg
(gauge)
Producer average request latency.
shown as millisecond
kafka.producer.bytes_out
(gauge)
Producer bytes out rate.
shown as byte
kafka.producer.message_rate
(gauge)
Producer message rate.
shown as message
kafka.producer.io_wait
(gauge)
Producer I/O wait time.
shown as nanosecond
kafka.consumer.max_lag
(gauge)
Maximum consumer lag.
shown as offset
kafka.consumer.fetch_rate
(gauge)
The minimum rate at which the consumer sends fetch requests to a broker.
shown as request
kafka.consumer.bytes_in
(gauge)
Consumer bytes in rate.
shown as byte
kafka.consumer.messages_in
(gauge)
Rate of consumer message consumption.
shown as message
kafka.consumer.zookeeper_commits
(gauge)
Rate of offset commits to ZooKeeper.
shown as write
kafka.consumer.kafka_commits
(gauge)
Rate of offset commits to Kafka.
shown as write

Events

The Kafka check does not include any event at this time.

Service Checks

The Kafka check does not include any service check at this time.

Troubleshooting

Agent failed to retrieve RMIServer stub

instance #kafka-localhost-<PORT_NUM> [ERROR]: 'Cannot connect to instance localhost:<PORT_NUM>. java.io.IOException: Failed to retrieve RMIServer stub

The Datadog Agent is unable to connect to the Kafka instance to retrieve metrics from the exposed mBeans over the RMI protocol.

Include the following JVM arguments when starting the Kafka instance to solve resolve this issue (required for Producer, Consumer, and Broker as they are all separate Java instances)

-Dcom.sun.management.jmxremote.port=<PORT_NUM> -Dcom.sun.management.jmxremote.rmi.port=<PORT_NUM>

Producer and Consumer metrics don’t appear in my Datadog application

By default we only collect broker based metrics.

If you’re running Java based Producers and Consumers, uncomment this section of the yaml file and point the Agent to the proper ports to start pulling in metrics:

# - host: remotehost
    # port: 9998 # Producer
    # tags:
    # kafka: producer0
    # env: stage
    # newTag: test
    # - host: remotehost
    # port: 9997 # Consumer
    # tags:
    # kafka: consumer0
    # env: stage
    # newTag: test

If you are using custom Producer and Consumer clients that are not written in Java and/or not exposing mBeans, having this enabled would still collect zero metrics. To still submit your metrics from your code use dogstatsd.

Further Reading