Google Kubernetes Engine, Cloud

Overview

Google Kubernetes Engine (GKE) is a powerful cluster manager and orchestration system for running your Docker containers.

Get metrics from Google Kubernetes Engine to:

  • Visualize the performance of your GKE containers and GKE control plane.
  • Correlate the performance of your GKE containers with your applications.

This integration comes with two separate preset dashboards:

  • The standard GKE dashboard presents the GKE and GKE control plane metrics collected from the Google integration.
  • The enhanced GKE dashboard presents metrics from Datadog’s Agent-based Kubernetes integration alongside GKE control plane metrics collected from the Google integration.

The standard dashboard provides observability in GKE with a simple configuration. The enhanced dashboard requires additional configuration steps, but provides more real-time Kubernetes metrics, and is often a better place to start from when cloning and customizing a dashboard for monitoring workloads in production.

Unlike self-hosted Kubernetes clusters, the GKE control plane is managed by Google and not accessible by a Datadog Agent running in the cluster. Therefore, observability into the GKE control plane requires the Google integration even if you are primarily using the Datadog Agent to monitor your clusters.

Setup

Metric collection

Installation

  1. If you haven’t already, set up the Google Cloud Platform integration first. There are no other installation steps for the standard metrics and preset dashboard.

  2. To populate the enhanced dashboard and enable APM tracing, logging, profiling, security, and other Datadog services, install the Datadog Agent into your GKE cluster.

  3. To populate the control plane metrics, you must enable GKE control plane metrics. Control plane metrics give you visibility into the operation of the Kubernetes control plane, which is managed by Google in GKE.

Log collection

Google Kubernetes Engine logs are collected with Google Cloud Logging and sent to a Dataflow job through a Cloud Pub/Sub topic. If you haven’t already, set up logging with the Datadog Dataflow template.

Once this is done, export your Google Kubernetes Engine logs from Google Cloud Logging to the Pub/Sub topic:

  1. Go to the GCP Logs Explorer page and filter Kubernetes and GKE logs.

  2. Click Create Sink and name the sink accordingly.

  3. Choose “Cloud Pub/Sub” as the destination and select the Pub/Sub topic that was created for that purpose. Note: The Pub/Sub topic can be located in a different project.

    Export Google Cloud Pub/Sub Logs to Pub Sub
  4. Click Create and wait for the confirmation message to show up.

Data Collected

Metrics

gcp.gke.container.accelerator.duty_cycle
(gauge)
Percent of time over the past sample period during which the accelerator was actively processing.
Shown as percent
gcp.gke.container.accelerator.memory_total
(gauge)
Total accelerator memory.
Shown as byte
gcp.gke.container.accelerator.memory_used
(gauge)
Total accelerator memory allocated.
Shown as byte
gcp.gke.container.accelerator.request
(gauge)
Number of accelerator devices requested by the container.
Shown as device
gcp.gke.container.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used by the container.
Shown as second
gcp.gke.container.cpu.limit_cores
(gauge)
CPU cores limit of the container.
Shown as core
gcp.gke.container.cpu.limit_utilization
(gauge)
Fraction of the CPU limit that is currently in use on the instance.
Shown as fraction
gcp.gke.container.cpu.request_cores
(gauge)
Number of CPU cores requested by the container.
Shown as core
gcp.gke.container.cpu.request_utilization
(gauge)
Fraction of the requested CPU that is currently in use on the instance.
Shown as fraction
gcp.gke.container.ephemeral_storage.limit_bytes
(gauge)
Local ephemeral storage limit.
Shown as byte
gcp.gke.container.ephemeral_storage.request_bytes
(gauge)
Local ephemeral storage request.
Shown as byte
gcp.gke.container.ephemeral_storage.used_bytes
(gauge)
Local ephemeral storage usage.
Shown as byte
gcp.gke.container.memory.limit_bytes
(gauge)
Memory limit of the container.
Shown as byte
gcp.gke.container.memory.limit_utlization
(gauge)
Fraction of the memory limit that is currently in use on the instance.
Shown as fraction
gcp.gke.container.memory.page_fault_count
(count)
Number of page faults, broken down by type.
Shown as fault
gcp.gke.container.memory.request_bytes
(gauge)
Memory request of the container.
Shown as byte
gcp.gke.container.memory.request_utilization
(gauge)
Fraction of the requested memory that is currently in use on the instance.
Shown as fraction
gcp.gke.container.memory.used_bytes
(gauge)
Memory usage of the container.
Shown as byte
gcp.gke.container.restart_count
(count)
Number of times the container has restarted.
Shown as occurrence
gcp.gke.container.uptime
(gauge)
Time in seconds that the container has been running.
Shown as second
gcp.gke.node.cpu.allocatable_cores
(gauge)
Number of allocatable CPU cores on the node.
Shown as core
gcp.gke.node.cpu.allocatable_utilization
(gauge)
Fraction of the allocatable CPU that is currently in use on the instance.
Shown as fraction
gcp.gke.node.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used on the node.
Shown as second
gcp.gke.node.cpu.total_cores
(gauge)
Total number of CPU cores on the node.
Shown as core
gcp.gke.node.ephemeral_storage.allocatable_bytes
(gauge)
Local ephemeral storage bytes allocatable on the node.
Shown as byte
gcp.gke.node.ephemeral_storage.inodes_free
(gauge)
Free number of inodes on local ephemeral storage.
gcp.gke.node.ephemeral_storage.inodes_total
(gauge)
Total number of inodes on local ephemeral storage.
gcp.gke.node.ephemeral_storage.total_bytes
(gauge)
Total ephemeral storage bytes on the node.
Shown as byte
gcp.gke.node.ephemeral_storage.used_bytes
(gauge)
Local ephemeral storage bytes used by the node.
Shown as byte
gcp.gke.node.memory.allocatable_bytes
(gauge)
Cumulative memory bytes used by the node.
Shown as byte
gcp.gke.node.memory.allocatable_utilization
(gauge)
Fraction of the allocatable memory that is currently in use on the instance.
Shown as fraction
gcp.gke.node.memory.total_bytes
(gauge)
Number of bytes of memory allocatable on the node.
Shown as byte
gcp.gke.node.memory.used_bytes
(gauge)
Cumulative memory bytes used by the node.
Shown as byte
gcp.gke.node.network.received_bytes_count
(count)
Cumulative number of bytes received by the node over the network.
Shown as byte
gcp.gke.node.network.sent_bytes_count
(count)
Cumulative number of bytes transmitted by the node over the network.
Shown as byte
gcp.gke.node.pid_limit
(gauge)
Max PID of OS on the node.
gcp.gke.node.pid_used
(gauge)
Number of running process in the OS on the node.
gcp.gke.node_daemon.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used by the node level system daemon.
Shown as second
gcp.gke.node_daemon.memory.used_bytes
(gauge)
Memory usage by the system daemon.
Shown as byte
gcp.gke.pod.network.received_bytes_count
(count)
Cumulative number of bytes received by the pod over the network.
Shown as byte
gcp.gke.pod.network.sent_bytes_count
(count)
Cumulative number of bytes transmitted by the pod over the network.
Shown as byte
gcp.gke.pod.volume.total_bytes
(gauge)
Total number of disk bytes available to the pod.
Shown as byte
gcp.gke.pod.volume.used_bytes
(gauge)
Number of disk bytes used by the pod.
Shown as byte
gcp.gke.pod.volume.utilization
(gauge)
Fraction of the volume that is currently being used by the instance.
Shown as fraction
gcp.gke.control_plane.apiserver.admission_controller_admission_duration_seconds
(gauge)
Admission controller latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).
Shown as second
gcp.gke.control_plane.apiserver.admission_step_admission_duration_seconds
(gauge)
Admission sub-step latency histogram in seconds, broken out for each operation and API resource and step type (validate or admit).
Shown as second
gcp.gke.control_plane.apiserver.admission_webhook_admission_duration_seconds
(gauge)
Admission webhook latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).
Shown as second
gcp.gke.control_plane.apiserver.current_inflight_requests
(gauge)
Maximal number of currently used inflight request limit of this apiserver per request kind.
Shown as request
gcp.gke.control_plane.apiserver.request_duration_seconds
(gauge)
Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.
Shown as second
gcp.gke.control_plane.apiserver.request_total
(gauge)
Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.
Shown as request
gcp.gke.control_plane.apiserver.response_sizes
(gauge)
Response size distribution in bytes for each group, version, verb, resource, subresource, scope and component.
Shown as byte
gcp.gke.control_plane.apiserver.storage_objects
(gauge)
Number of stored objects at the time of last check split by kind.
Shown as object
gcp.gke.control_plane.controller_manager.node_collector_evictions_number
(count)
Number of Node evictions that happened since current instance of NodeController started.
Shown as event
gcp.gke.control_plane.scheduler.pending_pods
(gauge)
Number of pending pods, by the queue type.
Shown as event
gcp.gke.control_plane.scheduler.pod_scheduling_duration_seconds
(gauge)
E2e latency for a pod being scheduled
Shown as second
gcp.gke.control_plane.scheduler.preemption_attempts_total
(count)
Total preemption attempts in the cluster till now
Shown as attempt
gcp.gke.control_plane.scheduler.preemption_victims
(gauge)
Number of selected preemption victims
Shown as event
gcp.gke.control_plane.scheduler.scheduling_attempt_duration_seconds
(gauge)
Scheduling attempt latency in seconds
Shown as second
gcp.gke.control_plane.scheduler.schedule_attempts_total
(gauge)
Number of attempts to schedule pods.
Shown as attempt
gcp.gke.control_plane.apiserver.aggregator_unavailable_apiservice
(gauge)
(Deprecated)
gcp.gke.control_plane.apiserver.audit_event_total
(gauge)
(Deprecated) Accumulated number audit events generated and sent to the audit backend
Shown as event
gcp.gke.control_plane.apiserver.audit_level_total
(gauge)
(Deprecated)
gcp.gke.control_plane.apiserver.audit_requests_rejected_total
(gauge)
(Deprecated)
Shown as request
gcp.gke.control_plane.apiserver.client_certificate_expiration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.etcd_object_counts
(gauge)
(Deprecated) Number of stored objects split by kind.
Shown as object
gcp.gke.control_plane.apiserver.etcd_request_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.init_events_total
(gauge)
(Deprecated)
Shown as event
gcp.gke.control_plane.apiserver.longrunning_gauge
(gauge)
(Deprecated) Gauge of all active long-running apiserver requests.
Shown as request
gcp.gke.control_plane.apiserver.registered_watchers
(gauge)
(Deprecated) Number of currently registered watchers for a given resource.
Shown as object
gcp.gke.control_plane.apiserver.workqueue_adds_total
(count)
(Deprecated)
gcp.gke.control_plane.apiserver.workqueue_depth
(gauge)
(Deprecated)
gcp.gke.control_plane.apiserver.workqueue_longest_running_processor_seconds
(gauge)
(Deprecated) Number of seconds that the longest running processor has been running.
Shown as second
gcp.gke.control_plane.apiserver.workqueue_queue_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.workqueue_retries_total
(count)
(Deprecated)
gcp.gke.control_plane.apiserver.workqueue_unfinished_work_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.workqueue_work_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.cloudprovider_gce_api_request_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.cronjob_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by cronjob controller
gcp.gke.control_plane.controller_manager.daemon_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by daemon controller
gcp.gke.control_plane.controller_manager.deployment_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by deployment controller
gcp.gke.control_plane.controller_manager.endpoint_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by endpoint controller
gcp.gke.control_plane.controller_manager.gc_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by GC controller
gcp.gke.control_plane.controller_manager.job_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by job controller
gcp.gke.control_plane.controller_manager.leader_election_master_status
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.namespace_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by namespace controller
gcp.gke.control_plane.controller_manager.node_collector_evictions_number
(count)
(Deprecated) Count of node eviction events.
gcp.gke.control_plane.controller_manager.node_collector_unhealthy_nodes_in_zone
(gauge)
(Deprecated) Number of unhealthy nodes
gcp.gke.control_plane.controller_manager.node_collector_zone_health
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.node_collector_zone_size
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.node_ipam_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by IPAM controller
gcp.gke.control_plane.controller_manager.node_lifecycle_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by lifecycle controller
gcp.gke.control_plane.controller_manager.persistentvolume_protection_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by persistent volume protection controller
gcp.gke.control_plane.controller_manager.persistentvolumeclaim_protection_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by persistent volume claim protection controller
gcp.gke.control_plane.controller_manager.replicaset_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by ReplicaSet controller
gcp.gke.control_plane.controller_manager.replication_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by replication controller
gcp.gke.control_plane.controller_manager.route_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by route controller
gcp.gke.control_plane.controller_manager.service_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by service controller
gcp.gke.control_plane.controller_manager.serviceaccount_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by service account controller
gcp.gke.control_plane.controller_manager.serviceaccount_tokens_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by service account tokens controller
gcp.gke.control_plane.controller_manager.workqueue_adds_total
(count)
(Deprecated)
gcp.gke.control_plane.controller_manager.workqueue_depth
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.workqueue_longest_running_processor_seconds
(gauge)
(Deprecated) Number of seconds that the longest running processor has been running.
Shown as second
gcp.gke.control_plane.controller_manager.workqueue_queue_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.workqueue_retries_total
(count)
(Deprecated)
gcp.gke.control_plane.controller_manager.workqueue_unfinished_work_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.workqueue_work_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.scheduler.binding_duration_seconds
(gauge)
(Deprecated) Number of latency in seconds.
Shown as second
gcp.gke.control_plane.scheduler.e2e_scheduling_duration_seconds
(gauge)
(Deprecated) Total e2e scheduling latency.
Shown as second
gcp.gke.control_plane.scheduler.framework_extension_point_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.scheduler.leader_election_master_status
(gauge)
(Deprecated)
gcp.gke.control_plane.scheduler.scheduling_algorithm_duration_seconds
(gauge)
(Deprecated) Total scheduling algorithm latency.
Shown as second
gcp.gke.control_plane.scheduler.scheduling_algorithm_preemption_evaluation_seconds
(gauge)
(Deprecated)
Shown as second

Events

The Google Kubernetes Engine integration does not include any events.

Service Checks

The Google Kubernetes Engine integration does not include any service checks.

Troubleshooting

Need help? Contact Datadog support.