For AI agents: A markdown version of this page is available at https://docs.datadoghq.com/data_streams/kafka/monitors_and_automation.md. A documentation index is available at /llms.txt.

Monitors and Automation

This product is not supported for your selected Datadog site. ().

After your Kafka clusters are connected to Data Streams Monitoring (see Kafka Monitoring Setup), the next step is to alert on the conditions that put your pipelines at risk and, where possible, automate the response.

This page covers:

Data Streams Monitoring ships with monitor templates that you can create directly from a cluster or topic detail page.

Cluster-level templates

TemplateDescriptionMetricCondition
Offline partitions detectedTopic data is unavailable for both reads and writes, risking message loss, consumer lag, and service outages until leadership is reassignedkafka.partition.offlineAny partition in the cluster is offline
Under-replicated partitions detectedTopic data has fewer in-sync replicas than configured, increasing risk of data loss if the leader broker fails before replication catches upkafka.partition.under_replicatedAny partition in the cluster is under-replicated

Both monitors are grouped by kafka_cluster_id so each cluster alerts its own owner.

Topic-level templates

TemplateDescriptionMetricCondition
Consumer lag is high for topicMeasured in seconds, indicating stale data served to customers, message backlog buildup, and delayed downstream processingkafka.estimated_consumer_lagConsumer lag in seconds exceeds a threshold for a topic and consumer group
Incoming message rate has droppedCatches silent producer failureskafka.topic.message_rateProduce rate to the topic drops below a threshold
Offline partitions on topicTopic data is unavailable for both reads and writes, risking message loss, consumer lag, and service outages until leadership is reassignedkafka.partition.offlineAny partition for this specific topic goes offline
Consumer lag is approaching time retention limitIncreased risk of data loss. Beyond the retention limit, the consumer cannot recover lost datakafka.estimated_consumer_lag / kafka.topic.config.retention_msEstimated lag approaches the topic’s time-based retention
Consumer lag is approaching bytes retention limitIncreased risk of data loss. Beyond the retention limit, the consumer cannot recover lost datakafka.consumer_lag × throughput / kafka.topic.config.retention_bytesEstimated lag approaches the topic’s bytes-based retention.

Requires Kafka broker metrics to be available

Automate responses to triggered monitors

When a monitor triggers, Datadog can take action automatically rather than waiting for a human to triage. Two options:

  • Workflow Automation — Build a Datadog Workflow that chains pre-built actions across your infrastructure and tools (PagerDuty, Slack, Jira, AWS, Kubernetes, and so on), and run it from a monitor trigger. Best for the “trigger a runbook” patterns below. See Trigger a workflow from a monitor.
  • Webhooks — Call any HTTP endpoint when a monitor triggers, recovers, or changes state. Best when the action lives in a system outside Datadog and you already have an HTTPS callback. See Webhooks integration.

Either option can be added to a monitor by mentioning it in the notification message: @workflow-<name> for Workflow Automation, @webhook-<name> for a webhook. Monitor metadata is available as template variables ({{topic.name}}, {{kafka_cluster_id.name}}, {{value}}, etc.) and can be passed to the workflow or webhook payload.

The following examples show conditions where automation is particularly valuable in a Kafka pipeline.

Consumer lag is high

Signals that a consumer group is falling behind its producer, with messages accumulating in the topic faster than they can be read.

Potential action: Run a workflow that scales the consumer group’s replica count (for example, with the Kubernetes or AWS actions in Workflow Automation), or call a CI/CD or autoscaler webhook.

Lag approaching retention limit

Signals that unread messages are approaching the topic’s retention window. If lag exceeds retention, those messages get deleted before the consumer can read them.

Potential action: Trigger an emergency runbook that can temporarily extend retention on the affected topic, pause the upstream producer, or scale the consumer group ahead of the threshold.

Broker disk filling up

Signals that a broker host is running low on disk space. If the disk fills up, the broker goes offline and its partitions become unavailable.

Potential action: Trigger a capacity workflow to add storage, expand the cluster, or reduce retention on a candidate topic.

Offline or under-replicated partitions

Signals that one or more partitions in the cluster are offline (unavailable) or under-replicated, which puts data durability at risk if a broker fails.

Potential action: Trigger a broker-health workflow — for example, restart a stuck broker or rebalance partitions.

Further reading