---
title: Kafka Broker
description: Collect metrics for producers and consumers, replication, max lag, and more.
breadcrumbs: Docs > Integrations > Kafka Broker
---

> For the complete documentation index, see [llms.txt](https://docs.datadoghq.com/llms.txt).

# Kafka Broker
Supported OS Integration version4.6.0
{% callout %}
# Important note for users on the following Datadog sites: us2.ddog-gov.com

{% alert level="info" %}
To find out if this integration is available in your organization, see your [Datadog Integrations](https://app.datadoghq.com/integrations) page or ask your organization administrator.

To initiate an exception request to enable this integration for your organization, email [support@ddog-gov.com](mailto:support@ddog-gov.com).
{% /alert %}

{% /callout %}


**New:** [Kafka Monitoring](https://docs.datadoghq.com/data_streams/kafka.md?utm_source=docs&utm_medium=callout&utm_campaign=DocsCTA-DSMKafka-IntegrationsKafka) tracks consumer lag, throughput, schemas, and adds message reading capabilities.

## Overview{% #overview %}

View Kafka broker metrics and logs for a 360-view of the health and performance of your Kafka clusters in real time.

**Note**:

- This check has a limit of 350 metrics per instance. The number of returned metrics is indicated in the Agent status output. Specify the metrics you are interested in by editing the configuration below. For more detailed instructions on customizing the metrics to collect, see the [JMX Checks documentation](https://docs.datadoghq.com/integrations/java.md).
- This integration attached sample configuration works only for Kafka >= 0.8.2. If you are running a version older than that, see the [Agent v5.2.x released sample files](https://raw.githubusercontent.com/DataDog/dd-agent/5.2.1/conf.d/kafka.yaml.example).

To collect Kafka consumer metrics, see the [kafka_consumer check](https://docs.datadoghq.com/integrations/kafka-consumer.md).

**Minimum Agent version:** 6.0.0

## Setup{% #setup %}

### Installation{% #installation %}

The Agent's Kafka check is included in the [Datadog Agent](https://app.datadoghq.com/account/settings/agent/latest) package, no additional installation is needed on your Kafka nodes.

The check collects metrics from JMX with [JMXFetch](https://github.com/DataDog/jmxfetch). A JVM is needed on each kafka node so the Agent can run JMXFetch. The same JVM that Kafka uses can be used for this.

**Note**: The Kafka check cannot be used with Managed Streaming for Apache Kafka (Amazon MSK). Use the [Amazon MSK integration](https://docs.datadoghq.com/integrations/amazon_msk.md#pagetitle) instead.

### Configuration{% #configuration %}

{% tab title="Host" %}
#### Host{% #host %}

To configure this check for an Agent running on a host:

##### Metric collection{% #metric-collection %}

1. Edit the `kafka.d/conf.yaml` file, in the `conf.d/` folder at the root of your [Agent's configuration directory](https://docs.datadoghq.com/agent/guide/agent-configuration-files.md#agent-configuration-directory). Kafka bean names depend on the exact Kafka version you're running. Use the [example configuration file](https://github.com/DataDog/integrations-core/blob/master/kafka/datadog_checks/kafka/data/conf.yaml.example) that comes packaged with the Agent as a base since it is the most up-to-date configuration. **Note**: the Agent version in the example may be for a newer version of the Agent than what you have installed.

1. [Restart the Agent](https://docs.datadoghq.com/agent/guide/agent-commands.md#start-stop-and-restart-the-agent).

##### Log collection{% #log-collection %}

*Available for Agent versions >6.0*

1. Kafka uses the `log4j` logger by default. To activate logging to a file and customize the format edit the `log4j.properties` file:

   ```text
     # Set root logger level to INFO and its only appender to R
     log4j.rootLogger=INFO, R
     log4j.appender.R.File=/var/log/kafka/server.log
     log4j.appender.R.layout=org.apache.log4j.PatternLayout
     log4j.appender.R.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
   ```

1. By default, the Datadog integration pipeline supports the following conversion patterns:

   ```text
     %d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
     %d [%t] %-5p %c - %m%n
     %r [%t] %p %c %x - %m%n
     [%d] %p %m (%c)%n
   ```

Clone and edit the [integration pipeline](https://docs.datadoghq.com/logs/processing.md#integration-pipelines) if you have a different format.

1. Collecting logs is disabled by default in the Datadog Agent, enable it in your `datadog.yaml` file:

   ```yaml
   logs_enabled: true
   ```

1. Add the following configuration block to your `kafka.d/conf.yaml` file. Change the `path` and `service` parameter values based on your environment. See the [sample kafka.d/conf.yaml](https://github.com/DataDog/integrations-core/blob/master/kafka/datadog_checks/kafka/data/conf.yaml.example) for all available configuration options.

   ```yaml
   logs:
     - type: file
       path: /var/log/kafka/server.log
       source: kafka
       service: myapp
       #To handle multi line that starts with yyyy-mm-dd use the following pattern
       #log_processing_rules:
       #  - type: multi_line
       #    name: log_start_with_date
       #    pattern: \d{4}\-(0?[1-9]|1[012])\-(0?[1-9]|[12][0-9]|3[01])
   ```

1. [Restart the Agent](https://docs.datadoghq.com/agent/guide/agent-commands.md#start-stop-and-restart-the-agent).

{% /tab %}

{% tab title="Containerized" %}
#### Containerized{% #containerized %}

##### Metric collection{% #metric-collection %}

For containerized environments, see the [Autodiscovery with JMX](https://docs.datadoghq.com/agent/guide/autodiscovery-with-jmx.md?tab=containerizedagent) guide.

##### Log collection{% #log-collection %}

*Available for Agent versions >6.0*

Collecting logs is disabled by default in the Datadog Agent. To enable it, see [Kubernetes Log Collection](https://docs.datadoghq.com/agent/kubernetes/log.md).

| Parameter      | Value                                              |
| -------------- | -------------------------------------------------- |
| `<LOG_CONFIG>` | `{"source": "kafka", "service": "<SERVICE_NAME>"}` |

{% /tab %}

### Validation{% #validation %}

[Run the Agent's status subcommand](https://docs.datadoghq.com/agent/guide/agent-commands.md#agent-status-and-information) and look for `kafka` under the **JMXFetch** section:

```text
========
JMXFetch
========
  Initialized checks
  ==================
    kafka
      instance_name : kafka-localhost-9999
      message :
      metric_count : 46
      service_check_count : 0
      status : OK
```

## Data Collected{% #data-collected %}

### Metrics{% #metrics %}

|  |
|  |
| **kafka.broker.start\_time**(gauge)                                           | The start time of the Kafka broker in milliseconds since epoch.*Shown as millisecond*                            |
| **kafka.consumer.bytes\_consumed**(gauge)                                     | The average number of bytes consumed per second for a specific topic.*Shown as byte*                             |
| **kafka.consumer.bytes\_in**(gauge)                                           | Consumer bytes in rate.*Shown as byte*                                                                           |
| **kafka.consumer.delayed\_requests**(gauge)                                   | Number of delayed consumer requests.*Shown as request*                                                           |
| **kafka.consumer.expires\_per\_second**(gauge)                                | Rate of delayed consumer request expiration.*Shown as eviction*                                                  |
| **kafka.consumer.fetch\_rate**(gauge)                                         | The minimum rate at which the consumer sends fetch requests to a broker.*Shown as request*                       |
| **kafka.consumer.fetch\_size\_avg**(gauge)                                    | The average number of bytes fetched per request for a specific topic.*Shown as byte*                             |
| **kafka.consumer.fetch\_size\_max**(gauge)                                    | The maximum number of bytes fetched per request for a specific topic.*Shown as byte*                             |
| **kafka.consumer.kafka\_commits**(gauge)                                      | Rate of offset commits to Kafka.*Shown as write*                                                                 |
| **kafka.consumer.max\_lag**(gauge)                                            | Maximum consumer lag.*Shown as offset*                                                                           |
| **kafka.consumer.messages\_in**(gauge)                                        | Rate of consumer message consumption.*Shown as message*                                                          |
| **kafka.consumer.records\_consumed**(gauge)                                   | The average number of records consumed per second for a specific topic.*Shown as record*                         |
| **kafka.consumer.records\_per\_request\_avg**(gauge)                          | The average number of records in each request for a specific topic.*Shown as record*                             |
| **kafka.consumer.zookeeper\_commits**(gauge)                                  | Rate of offset commits to ZooKeeper.*Shown as write*                                                             |
| **kafka.controller.active\_broker\_count**(gauge)                             | Number of active brokers in the cluster.*Shown as node*                                                          |
| **kafka.controller.fenced\_broker\_count**(gauge)                             | Number of fenced brokers in the cluster.*Shown as node*                                                          |
| **kafka.expires\_sec**(gauge)                                                 | Rate of delayed producer request expiration.*Shown as eviction*                                                  |
| **kafka.follower.expires\_per\_second**(gauge)                                | Rate of request expiration on followers.*Shown as eviction*                                                      |
| **kafka.kraft.append\_records\_rate**(gauge)                                  | Records appended per second by the Raft leader.*Shown as record*                                                 |
| **kafka.kraft.broker\_metadata.last\_applied\_record\_lag\_ms**(gauge)        | Lag in milliseconds of the last applied metadata record.*Shown as millisecond*                                   |
| **kafka.kraft.broker\_metadata.last\_applied\_record\_offset**(gauge)         | Offset of the last applied metadata record.*Shown as offset*                                                     |
| **kafka.kraft.broker\_metadata.last\_applied\_record\_timestamp**(gauge)      | Timestamp of the last applied metadata record.*Shown as millisecond*                                             |
| **kafka.kraft.broker\_metadata.metadata\_apply\_error\_count**(rate)          | Number of errors applying the metadata log.*Shown as error*                                                      |
| **kafka.kraft.broker\_metadata.metadata\_load\_error\_count**(rate)           | Number of errors loading the metadata log.*Shown as error*                                                       |
| **kafka.kraft.commit\_latency\_avg**(gauge)                                   | Average time to commit a Raft entry.*Shown as millisecond*                                                       |
| **kafka.kraft.commit\_latency\_max**(gauge)                                   | Maximum time to commit a Raft entry.*Shown as millisecond*                                                       |
| **kafka.kraft.current\_epoch**(gauge)                                         | Current quorum epoch.                                                                                            |
| **kafka.kraft.current\_leader**(gauge)                                        | Current quorum leader ID; -1 if unknown.                                                                         |
| **kafka.kraft.current\_vote**(gauge)                                          | Current voted leader ID; -1 if not voted.                                                                        |
| **kafka.kraft.election\_latency\_avg**(gauge)                                 | Average time for a leader election.*Shown as millisecond*                                                        |
| **kafka.kraft.election\_latency\_max**(gauge)                                 | Maximum time for a leader election.*Shown as millisecond*                                                        |
| **kafka.kraft.fetch\_records\_rate**(gauge)                                   | Records fetched per second from the Raft leader.*Shown as record*                                                |
| **kafka.kraft.high\_watermark**(gauge)                                        | High watermark of the Raft quorum; -1 if unknown.*Shown as offset*                                               |
| **kafka.kraft.log\_end\_epoch**(gauge)                                        | Log end epoch of the Raft quorum.                                                                                |
| **kafka.kraft.log\_end\_offset**(gauge)                                       | Log end offset of the Raft quorum.*Shown as offset*                                                              |
| **kafka.kraft.metadata\_loader.current\_controller\_id**(gauge)               | ID of the current controller; -1 if unknown.                                                                     |
| **kafka.kraft.metadata\_loader.current\_metadata\_version**(gauge)            | Feature level of the current effective metadata version.                                                         |
| **kafka.kraft.metadata\_loader.handle\_load\_snapshot\_count**(rate)          | Number of metadata snapshots loaded since process start.                                                         |
| **kafka.kraft.poll\_idle\_ratio\_avg**(gauge)                                 | Average fraction of time the poll loop is idle.*Shown as fraction*                                               |
| **kafka.kraft.snapshot\_emitter.latest\_snapshot\_generated\_age\_ms**(gauge) | Age in milliseconds of the latest generated snapshot.*Shown as millisecond*                                      |
| **kafka.kraft.snapshot\_emitter.latest\_snapshot\_generated\_bytes**(gauge)   | Size in bytes of the latest generated snapshot.*Shown as byte*                                                   |
| **kafka.kraft.unknown\_voter\_connections**(gauge)                            | Number of unknown voter connections whose information is not cached.*Shown as connection*                        |
| **kafka.log.directory.offline**(gauge)                                        | Whether a Kafka log directory is offline. 0 means healthy.                                                       |
| **kafka.log.flush\_rate.rate**(gauge)                                         | Log flush rate.*Shown as flush*                                                                                  |
| **kafka.log.partition.size**(gauge)                                           | The size in bytes of a topic partition log on disk.*Shown as byte*                                               |
| **kafka.messages\_in.rate**(gauge)                                            | Incoming message rate.*Shown as message*                                                                         |
| **kafka.net.bytes\_in.rate**(gauge)                                           | Incoming byte rate.*Shown as byte*                                                                               |
| **kafka.net.bytes\_out**(gauge)                                               | Outgoing byte total.*Shown as byte*                                                                              |
| **kafka.net.bytes\_out.rate**(gauge)                                          | Outgoing byte rate.*Shown as byte*                                                                               |
| **kafka.net.bytes\_rejected.rate**(gauge)                                     | Rejected byte rate.*Shown as byte*                                                                               |
| **kafka.net.processor.avg.idle.pct.rate**(gauge)                              | Average fraction of time the network processor threads are idle.*Shown as fraction*                              |
| **kafka.producer.available\_buffer\_bytes**(gauge)                            | The total amount of buffer memory that is not being used (either unallocated or in the free list)*Shown as byte* |
| **kafka.producer.batch\_size\_avg**(gauge)                                    | The average number of bytes sent per partition per-request.*Shown as byte*                                       |
| **kafka.producer.batch\_size\_max**(gauge)                                    | The max number of bytes sent per partition per-request.*Shown as byte*                                           |
| **kafka.producer.buffer\_bytes\_total**(gauge)                                | The maximum amount of buffer memory the client can use (whether or not it is currently used).*Shown as byte*     |
| **kafka.producer.bufferpool\_wait\_ratio**(gauge)                             | The fraction of time an appender waits for space allocation.                                                     |
| **kafka.producer.bufferpool\_wait\_time**(gauge)                              | The fraction of time an appender waits for space allocation.                                                     |
| **kafka.producer.bufferpool\_wait\_time\_ns\_total**(gauge)                   | The total time in nanoseconds an appender waits for space allocation.*Shown as nanosecond*                       |
| **kafka.producer.bytes\_out**(gauge)                                          | Producer bytes out rate.*Shown as byte*                                                                          |
| **kafka.producer.compression\_rate**(gauge)                                   | The average compression rate of record batches for a topic*Shown as fraction*                                    |
| **kafka.producer.compression\_rate\_avg**(rate)                               | The average compression rate of record batches.*Shown as fraction*                                               |
| **kafka.producer.delayed\_requests**(gauge)                                   | Number of producer requests delayed.*Shown as request*                                                           |
| **kafka.producer.expires\_per\_seconds**(gauge)                               | Rate of producer request expiration.*Shown as eviction*                                                          |
| **kafka.producer.io\_wait**(gauge)                                            | Producer I/O wait time.*Shown as nanosecond*                                                                     |
| **kafka.producer.message\_rate**(gauge)                                       | Producer message rate.*Shown as message*                                                                         |
| **kafka.producer.metadata\_age**(gauge)                                       | The age in seconds of the current producer metadata being used.*Shown as second*                                 |
| **kafka.producer.record\_error\_rate**(gauge)                                 | The average per-second number of errored record sends for a topic*Shown as error*                                |
| **kafka.producer.record\_queue\_time\_avg**(gauge)                            | The average time in ms record batches spent in the record accumulator.*Shown as millisecond*                     |
| **kafka.producer.record\_queue\_time\_max**(gauge)                            | The maximum time in ms record batches spent in the record accumulator.*Shown as millisecond*                     |
| **kafka.producer.record\_retry\_rate**(gauge)                                 | The average per-second number of retried record sends for a topic*Shown as record*                               |
| **kafka.producer.record\_send\_rate**(gauge)                                  | The average number of records sent per second for a topic*Shown as record*                                       |
| **kafka.producer.record\_size\_avg**(gauge)                                   | The average record size.*Shown as byte*                                                                          |
| **kafka.producer.record\_size\_max**(gauge)                                   | The maximum record size.*Shown as byte*                                                                          |
| **kafka.producer.records\_per\_request**(gauge)                               | The average number of records sent per second.*Shown as record*                                                  |
| **kafka.producer.request\_latency\_avg**(gauge)                               | Producer average request latency.*Shown as millisecond*                                                          |
| **kafka.producer.request\_latency\_max**(gauge)                               | The maximum request latency in ms.*Shown as millisecond*                                                         |
| **kafka.producer.request\_rate**(gauge)                                       | Number of producer requests per second.*Shown as request*                                                        |
| **kafka.producer.requests\_in\_flight**(gauge)                                | The current number of in-flight requests awaiting a response.*Shown as request*                                  |
| **kafka.producer.response\_rate**(gauge)                                      | Number of producer responses per second.*Shown as response*                                                      |
| **kafka.producer.throttle\_time\_avg**(gauge)                                 | The average time in ms a request was throttled by a broker.*Shown as millisecond*                                |
| **kafka.producer.throttle\_time\_max**(gauge)                                 | The maximum time in ms a request was throttled by a broker.*Shown as millisecond*                                |
| **kafka.producer.waiting\_threads**(gauge)                                    | The number of user threads blocked waiting for buffer memory to enqueue their records.*Shown as thread*          |
| **kafka.replication.active\_controller\_count**(gauge)                        | Number of active controllers in the cluster.*Shown as node*                                                      |
| **kafka.replication.isr\_expands.rate**(gauge)                                | Rate of replicas joining the ISR pool.*Shown as node*                                                            |
| **kafka.replication.isr\_shrinks.rate**(gauge)                                | Rate of replicas leaving the ISR pool.*Shown as node*                                                            |
| **kafka.replication.leader\_count**(gauge)                                    | Number of leaders on this broker.*Shown as node*                                                                 |
| **kafka.replication.leader\_elections.rate**(gauge)                           | Leader election rate.*Shown as event*                                                                            |
| **kafka.replication.max\_lag**(gauge)                                         | Maximum lag in messages between the follower and leader replicas.*Shown as offset*                               |
| **kafka.replication.offline\_partitions\_count**(gauge)                       | Number of partitions that don't have an active leader.                                                           |
| **kafka.replication.partition\_count**(gauge)                                 | Number of partitions across all topics in the cluster.                                                           |
| **kafka.replication.unclean\_leader\_elections.rate**(gauge)                  | Unclean leader election rate.*Shown as event*                                                                    |
| **kafka.replication.under\_min\_isr\_partition\_count**(gauge)                | Number of under min ISR partitions.                                                                              |
| **kafka.replication.under\_replicated\_partitions**(gauge)                    | Number of under replicated partitions.                                                                           |
| **kafka.request.channel.queue.size**(gauge)                                   | Number of queued requests.*Shown as request*                                                                     |
| **kafka.request.fetch.failed.rate**(gauge)                                    | Client fetch request failures rate.*Shown as request*                                                            |
| **kafka.request.fetch\_consumer.rate**(gauge)                                 | Fetch consumer requests rate.*Shown as request*                                                                  |
| **kafka.request.fetch\_consumer.time.99percentile**(gauge)                    | Total time in ms to serve the specified request.*Shown as millisecond*                                           |
| **kafka.request.fetch\_consumer.time.avg**(gauge)                             | Total time in ms to serve the specified request.*Shown as millisecond*                                           |
| **kafka.request.fetch\_follower.rate**(gauge)                                 | Fetch follower requests rate.*Shown as request*                                                                  |
| **kafka.request.fetch\_follower.time.99percentile**(gauge)                    | Total time in ms to serve the specified request.*Shown as millisecond*                                           |
| **kafka.request.fetch\_follower.time.avg**(gauge)                             | Total time in ms to serve the specified request.*Shown as millisecond*                                           |
| **kafka.request.fetch\_request\_purgatory.size**(gauge)                       | Number of requests waiting in the producer purgatory.*Shown as request*                                          |
| **kafka.request.handler.avg.idle.pct.rate**(gauge)                            | Average fraction of time the request handler threads are idle.*Shown as fraction*                                |
| **kafka.request.metadata.time.99percentile**(gauge)                           | Time for metadata requests for 99th percentile.*Shown as millisecond*                                            |
| **kafka.request.metadata.time.avg**(gauge)                                    | Average time for metadata request.*Shown as millisecond*                                                         |
| **kafka.request.offsets.time.99percentile**(gauge)                            | Time for offset requests for 99th percentile.*Shown as millisecond*                                              |
| **kafka.request.offsets.time.avg**(gauge)                                     | Average time for an offset request.*Shown as millisecond*                                                        |
| **kafka.request.produce.failed.rate**(gauge)                                  | Failed produce requests rate.*Shown as request*                                                                  |
| **kafka.request.produce.rate**(gauge)                                         | Produce requests rate.*Shown as request*                                                                         |
| **kafka.request.produce.time.99percentile**(gauge)                            | Time for produce requests for 99th percentile.*Shown as millisecond*                                             |
| **kafka.request.produce.time.avg**(gauge)                                     | Average time for a produce request.*Shown as millisecond*                                                        |
| **kafka.request.producer\_request\_purgatory.size**(gauge)                    | Number of requests waiting in the producer purgatory*Shown as request*                                           |
| **kafka.request.update\_metadata.time.99percentile**(gauge)                   | Time for update metadata requests for 99th percentile.*Shown as millisecond*                                     |
| **kafka.request.update\_metadata.time.avg**(gauge)                            | Average time for a request to update metadata.*Shown as millisecond*                                             |
| **kafka.server.socket.connection\_count**(gauge)                              | Number of currently open connections to the broker.*Shown as connection*                                         |
| **kafka.session.fetch.count**(gauge)                                          | Number of fetch sessions.                                                                                        |
| **kafka.session.fetch.eviction**(gauge)                                       | Eviction rate of fetch session.*Shown as event*                                                                  |
| **kafka.session.zookeeper.disconnect.rate**(gauge)                            | Zookeeper client disconnect rate.*Shown as event*                                                                |
| **kafka.session.zookeeper.expire.rate**(gauge)                                | Zookeeper client session expiration rate.*Shown as event*                                                        |
| **kafka.session.zookeeper.readonly.rate**(gauge)                              | Zookeeper client readonly rate.*Shown as event*                                                                  |
| **kafka.session.zookeeper.sync.rate**(gauge)                                  | Zookeeper client sync rate.*Shown as event*                                                                      |
| **kafka.topic.messages\_in.rate**(gauge)                                      | Incoming message rate by topic*Shown as message*                                                                 |
| **kafka.topic.net.bytes\_in.rate**(gauge)                                     | Incoming byte rate by topic.*Shown as byte*                                                                      |
| **kafka.topic.net.bytes\_out.rate**(gauge)                                    | Outgoing byte rate by topic.*Shown as byte*                                                                      |
| **kafka.topic.net.bytes\_rejected.rate**(gauge)                               | Rejected byte rate by topic.*Shown as byte*                                                                      |

### Events{% #events %}

The Kafka check does not include any events.

### Service Checks{% #service-checks %}

**kafka.can\_connect**

Returns `CRITICAL` if the Agent is unable to connect to and collect metrics from the monitored Kafka instance, `WARNING` if no metrics are collected, and `OK` otherwise.

*Statuses: ok, critical, warning*

## Troubleshooting{% #troubleshooting %}

- [Troubleshooting and Deep Dive for Kafka](https://docs.datadoghq.com/integrations/faq/troubleshooting-and-deep-dive-for-kafka.md)
- [Agent failed to retrieve RMIServer stub](https://docs.datadoghq.com/integrations/guide/agent-failed-to-retrieve-rmiserver-stub.md)

## Further Reading{% #further-reading %}

- [Monitoring Kafka performance metrics](https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics)
- [Collecting Kafka performance metrics](https://www.datadoghq.com/blog/collecting-kafka-performance-metrics)
- [Monitoring Kafka with Datadog](https://www.datadoghq.com/blog/monitor-kafka-with-datadog)
- [What is Apache Kafka? How it Works & Use Cases](https://www.datadoghq.com/knowledge-center/apache-kafka/)