Confluent Cloud

Integration version1.0.0

Confluent Cloud Dashboard Overview

Overview

Confluent Cloud is a fully managed, cloud-hosted streaming data service. Connect Datadog with Confluent Cloud to visualize and alert on key metrics for your Confluent Cloud resources.

Datadog’s out-of-the-box Confluent Cloud dashboard shows you key cluster metrics for monitoring the health and performance of your environment, including information such as the rate of change in active connections and your ratio of average consumed to produced records.

You can use recommended monitors to notify and alert your team when topic lag is getting too high, or use these metrics to create your own.

Setup

Installation

Install the integration with the Datadog Confluent Cloud integration tile.

Configuration

In Confluent Cloud, click + Add API Key to enter your Confluent Cloud API Key and API Secret.
- Create a Cloud Resource Management API key and secret.
- Click Save. Datadog searches for accounts associated with those credentials.
- In the Datadog integration configuration, add the API key and secret to the API Key and API Secret fields.
Add your Confluent Cloud Cluster ID or Connector ID. Datadog crawls the Confluent Cloud metrics and loads metrics within minutes.
To collect your tags defined in Confluent Cloud (optional):
- Create a Schema Registry API key and secret. Read more about Schema Management on Confluent Cloud.
- Click Save. Datadog collects tags defined in Confluent Cloud.
- In the Datadog integration configuration, add the API key and secret to the Schema Registry API Key and Secret fields.
If you use Cloud Cost Management and enable collecting cost data:
- Ensure that the API key has the BillingAdmin role enabled.
- It will be visible in Cloud Cost Management within 24 hours. (collected data)

For more information about configuration resources, such as Clusters and Connectors, refer to the Confluent Cloud Integration documentation.

API Key and secret

To create your Confluent Cloud API Key and Secret, see Add the MetricsViewer role to a new service account in the UI.

Cluster ID

To find your Confluent Cloud Cluster ID:

In Confluent Cloud, navigate to Environment Overview and select the cluster you want to monitor.
In the left-hand navigation, click Cluster overview > Cluster settings.
Under Identification, copy the Cluster ID beginning with lkc.

Connector ID

To find your Confluent Cloud Connector ID:

In Confluent Cloud, navigate to Environment Overview and select the cluster you want to monitor.
In the left-hand navigation, click Data integration > Connectors.
Under Connectors, copy the Connector ID beginning with lcc.

Dashboards

After configuring the integration, see the out-of-the-box Confluent Cloud dashboard for an overview of Kafka cluster and connector metrics.

By default, all metrics collected across Confluent Cloud are displayed.

Data Collected

Metrics


confluent_cloud.kafka.received_bytes (count)	The delta count of bytes received from the network. Each sample is the number of bytes received since the previous data sample. The count is sampled every 60 seconds. Shown as byte
confluent_cloud.kafka.sent_bytes (count)	The delta count of bytes sent over the network. Each sample is the number of bytes sent since the previous data point. The count is sampled every 60 seconds. Shown as byte
confluent_cloud.kafka.received_records (count)	The delta count of records received. Each sample is the number of records received since the previous data sample. The count is sampled every 60 seconds. Shown as record
confluent_cloud.kafka.sent_records (count)	The delta count of records sent. Each sample is the number of records sent since the previous data point. The count is sampled every 60 seconds. Shown as record
confluent_cloud.kafka.retained_bytes (gauge)	The current count of bytes retained by the cluster. The count is sampled every 60 seconds. Shown as byte
confluent_cloud.kafka.active_connection_count (gauge)	The count of active authenticated connections. Shown as connection
confluent_cloud.kafka.connection_info (gauge)	Client connection metadata. Shown as connection
confluent_cloud.kafka.request_count (count)	The delta count of requests received over the network. Each sample is the number of requests received since the previous data point. The count is sampled every 60 seconds. Shown as request
confluent_cloud.kafka.partition_count (gauge)	The number of partitions.
confluent_cloud.kafka.successful_authentication_count (count)	The delta count of successful authentications. Each sample is the number of successful authentications since the previous data point. The count is sampled every 60 seconds. Shown as attempt
confluent_cloud.kafka.cluster_link_destination_response_bytes (count)	The delta count of cluster linking response bytes from all request types. Each sample is the number of bytes sent since the previous data point. The count is sampled every 60 seconds. Shown as byte
confluent_cloud.kafka.cluster_link_source_response_bytes (count)	The delta count of cluster linking source response bytes from all request types. Each sample is the number of bytes sent since the previous data point. The count is sampled every 60 seconds. Shown as byte
confluent_cloud.kafka.cluster_active_link_count (gauge)	The current count of active cluster links. The count is sampled every 60 seconds. The implicit time aggregation for this metric is MAX.
confluent_cloud.kafka.cluster_load_percent (gauge)	A measure of the utilization of the cluster. The value is between 0.0 and 1.0. Shown as percent
confluent_cloud.kafka.cluster_load_percent_max (gauge)	A measure of the maximum broker utilization across the cluster. The value is between 0.0 and 1.0. Shown as percent
confluent_cloud.kafka.cluster_load_percent_avg (gauge)	A measure of the average utilization across the cluster. The value is between 0.0 and 1.0. Shown as percent
confluent_cloud.kafka.consumer_lag_offsets (gauge)	The lag between a group member’s committed offset and the partition’s high watermark. Tagged with `consumer_group_id` and topic.
confluent_cloud.kafka.cluster_link_count (gauge)	The current count of cluster links. The count is sampled every 60 seconds. The implicit time aggregation for this metric is MAX.
confluent_cloud.kafka.cluster_link_task_count (gauge)	The current count of cluster links tasks. The count is sampled every 60 seconds. The implicit time aggregation for this metric is MAX.
confluent_cloud.kafka.cluster_link_mirror_transition_in_error (gauge)	The cluster linking mirror topic state transition error count for a link. The count is sampled every 60 seconds.
confluent_cloud.kafka.cluster_link_mirror_topic_bytes (count)	The delta count of cluster linking mirror topic bytes. The count is sampled every 60 seconds.
confluent_cloud.kafka.cluster_link_mirror_topic_count (gauge)	The cluster linking mirror topic count for a link. The count is sampled every 60 seconds.
confluent_cloud.kafka.cluster_link_mirror_topic_offset_lag (gauge)	TThe cluster linking mirror topic offset lag maximum across all partitions. The lag is sampled every 60 seconds.
confluent_cloud.kafka.request_bytes (gauge)	The delta count of total request bytes from the specified request types sent over the network. Each sample is the number of bytes sent since the previous data point. The count is sampled every 60 seconds.
confluent_cloud.kafka.response_bytes (gauge)	The delta count of total response bytes from the specified response types sent over the network. Each sample is the number of bytes sent since the previous data point. The count is sampled every 60 seconds.
confluent_cloud.kafka.rest_produce_request_bytes (count)	The delta count of total request bytes from Kafka REST produce calls sent over the network requested by Kafka REST.
confluent_cloud.kafka.dedicated_cku_count (count)	CKU count of a dedicated cluster
confluent_cloud.kafka.producer_latency_avg_milliseconds (gauge)	The average latency of client producer request. Shown as millisecond
confluent_cloud.connect.sent_records (count)	The delta count of total number of records sent from the transformations and written to Kafka for the source connector. Each sample is the number of records sent since the previous data point. The count is sampled every 60 seconds. Shown as record
confluent_cloud.connect.received_records (count)	The delta count of total number of records received by the sink connector. Each sample is the number of records received since the previous data point. The count is sampled every 60 seconds. Shown as record
confluent_cloud.connect.sent_bytes (count)	The delta count of total bytes sent from the transformations and written to Kafka for the source connector. Each sample is the number of bytes sent since the previous data point. The count is sampled every 60 seconds. Shown as byte
confluent_cloud.connect.received_bytes (count)	The delta count of total bytes received by the sink connector. Each sample is the number of bytes received since the previous data point. The count is sampled every 60 seconds. Shown as byte
confluent_cloud.connect.dead_letter_queue_records (count)	The delta count of dead letter queue records written to Kafka for the sink connector. The count is sampled every 60 seconds. Shown as record
confluent_cloud.connect.connector_status (count)	This metric monitors the status of a connector within the system. Its value is always set to 1 which signifies connector presence. The current operational state of the connector is identified through the status tag. Shown as record
confluent_cloud.connect.sql_server_cdc_source_connector_snapshot_running (gauge)	It represents whether the Snapshot is running. The values will incorporate any differences between the clocks on the machines where the database server and the connector are running.
confluent_cloud.connect.sql_server_cdc_source_connector_snapshot_completed (gauge)	It represents whether the Snapshot is completed. The values will incorporate any differences between the clocks on the machines where the database server and the connector are running.
confluent_cloud.connect.sql_server_cdc_source_connector_schema_history_status (gauge)	It represents the status of the schema history of the connector. The values will incorporate any differences between the clocks on the machines where the database server and the connector are running.
confluent_cloud.connect.mysql_cdc_source_connector_snapshot_running (gauge)	It represents whether the Snapshot is running. The values will incorporate any differences between the clocks on the machines where the database server and the connector are running.
confluent_cloud.connect.mysql_cdc_source_connector_snapshot_completed (gauge)	It represents whether the Snapshot is completed. The values will incorporate any differences between the clocks on the machines where the database server and the connector are running.
confluent_cloud.connect.mysql_cdc_source_connector_schema_history_status (gauge)	It represents the status of the schema history of the connector. The values will incorporate any differences between the clocks on the machines where the database server and the connector are running.
confluent_cloud.connect.postgres_cdc_source_connector_snapshot_running (gauge)	It represents whether the Snapshot is running. The values will incorporate any differences between the clocks on the machines where the database server and the connector are running.
confluent_cloud.connect.postgres_cdc_source_connector_snapshot_completed (gauge)	It represents whether the Snapshot is completed. The values will incorporate any differences between the clocks on the machines where the database server and the connector are running.
confluent_cloud.connect.connector_task_status (gauge)	Monitors the status of a connector’s task within the system. Its value is always set to 1 signifying the connector task’s presence.
confluent_cloud.connect.connector_task_batch_size_avg (gauge)	Monitors the average batch size (measured by record count) per minute. For a source connector it indicates the average batch size sent to Kafka. Shown as percent
confluent_cloud.connect.connector_task_batch_size_max (gauge)	Monitors the maximum batch size (measured by record count) per minute. For a source connector it indicates the max batch size sent to Kafka. Shown as percent
confluent_cloud.ksql.streaming_unit_count (gauge)	The count of Confluent Streaming Units (CSUs) for this KSQL instance. The count is sampled every 60 seconds. The implicit time aggregation for this metric is MAX. Shown as unit
confluent_cloud.ksql.query_saturation (gauge)	The maximum saturation for a given ksqlDB query across all nodes. Returns a value between 0 and 1. A value close to 1 indicates that ksqlDB query processing is bottlenecked on available resources.
confluent_cloud.ksql.task_stored_bytes (gauge)	The size of a given task’s state stores in bytes. Shown as byte
confluent_cloud.ksql.storage_utilization (gauge)	The total storage utilization for a given ksqlDB application.
confluent_cloud.schema_registry.schema_count (gauge)	The number of registered schemas.
confluent_cloud.schema_registry.request_count (count)	The delta count of requests received by the schema registry server. Each sample is the number of requests received since the previous data point. The count sampled every 60 seconds.
confluent_cloud.kafka.deprecated_request_count (count)	The delta count of deprecated requests received over the network. Each sample is the number of requests received since the previous data point. The count is sampled every 60 seconds. Shown as request
confluent_cloud.schema_registry.schema_operations_count (count)	The delta count of schema related operations. Each sample is the number of requests received since the previous data point. The count sampled every 60 seconds.
confluent_cloud.flink.num_records_in (count)	Total number of records all Flink SQL statements leveraging a Flink compute pool have received.
confluent_cloud.flink.num_records_out (count)	Total number of records all Flink SQL statements leveraging a Flink compute pool have emitted.
confluent_cloud.flink.pending_records (gauge)	Total backlog of all Flink SQL statements leveraging a Flink compute pool.
confluent_cloud.flink.compute_pool_utilization.current_cfus (gauge)	The absolute number of CFUs at a given moment.
confluent_cloud.flink.compute_pool_utilization.cfu_minutes_consumed (count)	The number of how many CFUs consumed since the last measurement.
confluent_cloud.flink.compute_pool_utilization.cfu_limit (gauge)	The possible max number of CFUs for the pool.
confluent_cloud.flink.current_input_watermark_ms (gauge)	The last watermark this statement has received (in milliseconds) for the given table.
confluent_cloud.flink.current_output_watermark_ms (gauge)	The last watermark this statement has produced (in milliseconds) to the given table.
confluent_cloud.custom.kafka.consumer_lag_offsets (gauge)	The lag between a group member’s committed offset and the partition’s high watermark. Tagged with `consumer_group_id`, topic, partition, `consumer_group_id` and `client_id`.

Events

The Confluent Cloud integration does not include any events.

Service Checks

The Confluent Cloud integration does not include any service checks.

Troubleshooting

Need help? Contact Datadog support.