This product is not supported for your selected
Datadog site. (
).
After your Kafka clusters are connected to Data Streams Monitoring (see Kafka Monitoring Setup), the next step is to alert on the conditions that put your pipelines at risk and, where possible, automate the response.
This page covers:
Recommended monitors
Data Streams Monitoring ships with monitor templates that you can create directly from a cluster or topic detail page.
Cluster-level templates
| Template | Description | Metric | Condition |
|---|
| Offline partitions detected | Topic data is unavailable for both reads and writes, risking message loss, consumer lag, and service outages until leadership is reassigned | kafka.partition.offline | Any partition in the cluster is offline |
| Under-replicated partitions detected | Topic data has fewer in-sync replicas than configured, increasing risk of data loss if the leader broker fails before replication catches up | kafka.partition.under_replicated | Any partition in the cluster is under-replicated |
Both monitors are grouped by kafka_cluster_id so each cluster alerts its own owner.
Topic-level templates
| Template | Description | Metric | Condition |
|---|
| Consumer lag is high for topic | Measured in seconds, indicating stale data served to customers, message backlog buildup, and delayed downstream processing | kafka.estimated_consumer_lag | Consumer lag in seconds exceeds a threshold for a topic and consumer group |
| Incoming message rate has dropped | Catches silent producer failures | kafka.topic.message_rate | Produce rate to the topic drops below a threshold |
| Offline partitions on topic | Topic data is unavailable for both reads and writes, risking message loss, consumer lag, and service outages until leadership is reassigned | kafka.partition.offline | Any partition for this specific topic goes offline |
| Consumer lag is approaching time retention limit | Increased risk of data loss. Beyond the retention limit, the consumer cannot recover lost data | kafka.estimated_consumer_lag / kafka.topic.config.retention_ms | Estimated lag approaches the topic’s time-based retention |
| Consumer lag is approaching bytes retention limit | Increased risk of data loss. Beyond the retention limit, the consumer cannot recover lost data | kafka.consumer_lag × throughput / kafka.topic.config.retention_bytes | Estimated lag approaches the topic’s bytes-based retention.
Requires Kafka broker metrics to be available |
Automate responses to triggered monitors
When a monitor triggers, Datadog can take action automatically rather than waiting for a human to triage. Two options:
- Workflow Automation — Build a Datadog Workflow that chains pre-built actions across your infrastructure and tools (PagerDuty, Slack, Jira, AWS, Kubernetes, and so on), and run it from a monitor trigger. Best for the “trigger a runbook” patterns below. See Trigger a workflow from a monitor.
- Webhooks — Call any HTTP endpoint when a monitor triggers, recovers, or changes state. Best when the action lives in a system outside Datadog and you already have an HTTPS callback. See Webhooks integration.
Either option can be added to a monitor by mentioning it in the notification message: @workflow-<name> for Workflow Automation, @webhook-<name> for a webhook. Monitor metadata is available as template variables ({{topic.name}}, {{kafka_cluster_id.name}}, {{value}}, etc.) and can be passed to the workflow or webhook payload.
The following examples show conditions where automation is particularly valuable in a Kafka pipeline.
Consumer lag is high
Signals that a consumer group is falling behind its producer, with messages accumulating in the topic faster than they can be read.
Potential action: Run a workflow that scales the consumer group’s replica count (for example, with the Kubernetes or AWS actions in Workflow Automation), or call a CI/CD or autoscaler webhook.
Lag approaching retention limit
Signals that unread messages are approaching the topic’s retention window. If lag exceeds retention, those messages get deleted before the consumer can read them.
Potential action: Trigger an emergency runbook that can temporarily extend retention on the affected topic, pause the upstream producer, or scale the consumer group ahead of the threshold.
Broker disk filling up
Signals that a broker host is running low on disk space. If the disk fills up, the broker goes offline and its partitions become unavailable.
Potential action: Trigger a capacity workflow to add storage, expand the cluster, or reduce retention on a candidate topic.
Offline or under-replicated partitions
Signals that one or more partitions in the cluster are offline (unavailable) or under-replicated, which puts data durability at risk if a broker fails.
Potential action: Trigger a broker-health workflow — for example, restart a stuck broker or rebalance partitions.
Further reading
Additional helpful documentation, links, and articles: