Kubernetes Cluster Autoscaler

Supported OS Linux Windows Mac OS

Integration version2.2.0

Agent Check: Kubernetes Cluster Autoscaler

Overview

This check monitors Kubernetes Cluster Autoscaler through the Datadog Agent.

Setup

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.

Installation

The Kubernetes Cluster Autoscaler check is included in the Datadog Agent package. (Agent >= 7.55.x) No additional installation is needed on your server.

Configuration

  1. Edit the kubernetes_cluster_autoscaler.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your kubernetes_cluster_autoscaler performance data. See the sample kubernetes_cluster_autoscaler.d/conf.yaml for all available configuration options.

  2. Restart the Agent.

Metric collection

Make sure that the Prometheus-formatted metrics are exposed in your kubernetes_cluster_autoscaler cluster. For the Agent to start collecting metrics, the kubernetes_cluster_autoscaler pods need to be annotated.

Kubernetes Cluster Autoscaler has metrics and livenessProbe endpoints that can be accessed on port 8085. These endpoints are located under /metrics and /health-check and provide valuable information about the state of your cluster during scaling operations.

Note: To change the default port, use the --address flag.

To configure the Cluster Autoscaler to expose metrics, do the following:

  1. Enable access to the /metrics route and expose port 8085 for your Cluster Autoscaler deployment:
ports:
--name: app
containerPort: 8085

b) instruct your Prometheus to scrape it, by adding the following annotation to your Cluster Autoscaler service:

prometheus.io/scrape: true

Note: The listed metrics can only be collected if they are available. Some metrics are generated only when certain actions are performed.

The only parameters required for configuring the kubernetes_cluster_autoscaler check are:

  • CONTAINER_NAME Name of the container of the cluster autoscaler controller.
  • openmetrics_endpoint This parameter should be set to the location where the Prometheus-formatted metrics are exposed. The default port is 8085. To configure a different port, use the METRICS_PORT environment variable. In containerized environments, %%host%% should be used for host autodetection.
apiVersion: v1
kind: Pod
# (...)
metadata:
  name: '<POD_NAME>'
  annotations:
    ad.datadoghq.com/<CONTAINER_NAME>.checks: |
      {
        "kubernetes_cluster_autoscaler": {
          "init_config": {},
          "instances": [
            {
              "openmetrics_endpoint": "http://%%host%%:8085/metrics"
            }
          ]
        }
      }      
    # (...)
spec:
  containers:
    - name: '<CONTAINER_NAME>'
# (...)

Validation

Run the Agent’s status subcommand and look for kubernetes_cluster_autoscaler under the Checks section.

Data Collected

Metrics

kubernetes_cluster_autoscaler.cluster.cpu.current.cores
(gauge)
Current CPU cores usage in the cluster
kubernetes_cluster_autoscaler.cluster.memory.current.bytes
(gauge)
Current memory usage in bytes in the cluster
kubernetes_cluster_autoscaler.cluster.safe.to.autoscale
(gauge)
Indicates whether the cluster is safe to autoscale
kubernetes_cluster_autoscaler.cpu.limits.cores
(gauge)
Total CPU cores limits set for pods in the cluster
kubernetes_cluster_autoscaler.created.node.groups.count
(count)
Total count of node groups created in the cluster
kubernetes_cluster_autoscaler.deleted.node.groups.count
(count)
Total count of node groups deleted in the cluster
kubernetes_cluster_autoscaler.errors.count
(count)
Total count of errors occurred in the cluster
kubernetes_cluster_autoscaler.evicted.pods.count
(count)
Total count of evicted pods in the cluster
kubernetes_cluster_autoscaler.failed.scale.ups.count
(count)
Total count of failed scale-up operations in the cluster
kubernetes_cluster_autoscaler.function.duration.seconds.bucket
(count)
Duration of a specific function in the cluster (bucket)
kubernetes_cluster_autoscaler.function.duration.seconds.count
(count)
Duration of a specific function in the cluster (count)
kubernetes_cluster_autoscaler.function.duration.seconds.sum
(count)
Duration of a specific function in the cluster (sum)
kubernetes_cluster_autoscaler.go.gc.duration.seconds.count
(count)
A summary of the pause duration of garbage collection cycles.
Shown as second
kubernetes_cluster_autoscaler.go.gc.duration.seconds.quantile
(gauge)
A summary of the pause duration of garbage collection cycles
Shown as second
kubernetes_cluster_autoscaler.go.gc.duration.seconds.sum
(count)
A summary of the pause duration of garbage collection cycles
Shown as second
kubernetes_cluster_autoscaler.go.goroutines
(gauge)
Number of goroutines that currently exist
kubernetes_cluster_autoscaler.go.info
(gauge)
Information about the Go environment
kubernetes_cluster_autoscaler.go.memstats.alloc_bytes
(gauge)
Number of bytes allocated and still in use
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.alloc_bytes.count
(count)
Total number of bytes allocated even if freed
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.buck_hash.sys_bytes
(gauge)
Number of bytes used by the profiling bucket hash table
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.frees.count
(count)
Total number of frees
kubernetes_cluster_autoscaler.go.memstats.gc.sys_bytes
(gauge)
Number of bytes used for garbage collection system metadata
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.heap.alloc_bytes
(gauge)
Number of heap bytes allocated and still in use
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.heap.idle_bytes
(gauge)
Number of heap bytes waiting to be used
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.heap.inuse_bytes
(gauge)
Number of heap bytes that are in use
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.heap.objects
(gauge)
Number of allocated objects
Shown as object
kubernetes_cluster_autoscaler.go.memstats.heap.released_bytes
(gauge)
Number of heap bytes released to OS
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.heap.sys_bytes
(gauge)
Number of heap bytes obtained from system
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.lookups.count
(count)
Total number of pointer lookups
kubernetes_cluster_autoscaler.go.memstats.mallocs.count
(count)
Total number of mallocs
kubernetes_cluster_autoscaler.go.memstats.mcache.inuse_bytes
(gauge)
Number of bytes in use by mcache structures
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.mcache.sys_bytes
(gauge)
Number of bytes used for mcache structures obtained from system
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.mspan.inuse_bytes
(gauge)
Number of bytes in use by mspan structures
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.mspan.sys_bytes
(gauge)
Number of bytes used for mspan structures obtained from system
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.next.gc_bytes
(gauge)
Number of heap bytes when next garbage collection will take place
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.other.sys_bytes
(gauge)
Number of bytes used for other system allocations
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.stack.inuse_bytes
(gauge)
Number of bytes in use by the stack allocator
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.stack.sys_bytes
(gauge)
Number of bytes obtained from system for stack allocator
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.sys_bytes
(gauge)
Number of bytes obtained from system
Shown as byte
kubernetes_cluster_autoscaler.go.threads
(gauge)
Number of OS threads created
Shown as thread
kubernetes_cluster_autoscaler.last.activity
(gauge)
Timestamp of the last activity in the cluster
kubernetes_cluster_autoscaler.max.nodes.count
(gauge)
Maximum number of nodes allowed in the cluster
kubernetes_cluster_autoscaler.memory.limits.bytes
(gauge)
Total memory limits set for pods in the cluster
kubernetes_cluster_autoscaler.nap.enabled
(gauge)
Indicates whether Node Auto-Provisioning (NAP) is enabled in the cluster
kubernetes_cluster_autoscaler.node.groups.count
(gauge)
Number of node groups in the cluster
kubernetes_cluster_autoscaler.nodes.count
(gauge)
Number of nodes in cluster
kubernetes_cluster_autoscaler.old.unregistered.nodes.removed.count
(count)
Total count of old unregistered nodes removed from the cluster
kubernetes_cluster_autoscaler.scaled.down.gpu.nodes.count
(count)
Total count of GPU nodes scaled down in the cluster
kubernetes_cluster_autoscaler.scaled.down.nodes.count
(count)
Total count of nodes scaled down in the cluster
kubernetes_cluster_autoscaler.scaled.up.gpu.nodes.count
(count)
Total count of GPU nodes scaled up in the cluster
kubernetes_cluster_autoscaler.scaled.up.nodes.count
(count)
Total count of nodes scaled up in the cluster
kubernetes_cluster_autoscaler.skipped.scale.events.count
(count)
Total count of skipped scale events in the cluster
kubernetes_cluster_autoscaler.unneeded.nodes.count
(gauge)
Total count of unneeded nodes in the cluster
kubernetes_cluster_autoscaler.unschedulable.pods.count
(gauge)
Number of unschedulable pods in the cluster

Events

The Kubernetes Cluster Autoscaler integration does not include any events.

Service Checks

kubernetes_cluster_autoscaler.openmetrics.health

Returns CRITICAL if the Agent is unable to connect to the Kubernetes Cluster Autoscaler OpenMetrics endpoint, otherwise returns OK.

Statuses: ok, critical

Troubleshooting

Need help? Contact Datadog support.