Kubernetes Cluster Autoscaler

Supported OS Linux Windows Mac OS

Integration version1.0.1

Overview

This check monitors Kubernetes Cluster Autoscaler through the Datadog Agent.

Setup

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.

Installation

The Kubernetes Cluster Autoscaler check is included in the Datadog Agent package. No additional installation is needed on your server.

Configuration

  1. Edit the kubernetes_cluster_autoscaler.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your kubernetes_cluster_autoscaler performance data. See the sample kubernetes_cluster_autoscaler.d/conf.yaml for all available configuration options.

  2. Restart the Agent.

Metric collection

Make sure that the Prometheus-formatted metrics are exposed in your kubernetes_cluster_autoscaler cluster. For the Agent to start collecting metrics, the kubernetes_cluster_autoscaler pods need to be annotated.

Kubernetes Cluster Autoscaler has metrics and livenessProbe endpoints that can be accessed on port 8085. These endpoints are located under /metrics and /health-check and provide valuable information about the state of your cluster during scaling operations.

Note: To change the default port, use the --address flag.

To configure the Cluster Autoscaler to expose metrics, do the following:

  1. Enable access to the /metrics route and expose port 8085 for your Cluster Autoscaler deployment:
ports:
--name: app
containerPort: 8085

b) instruct your Prometheus to scrape it, by adding the following annotation to your Cluster Autoscaler service:

prometheus.io/scrape: true

Note: The listed metrics can only be collected if they are available. Some metrics are generated only when certain actions are performed.

The only parameter required for configuring the kubernetes_cluster_autoscaler check is openmetrics_endpoint. This parameter should be set to the location where the Prometheus-formatted metrics are exposed. The default port is 8085. To configure a different port, use the METRICS_PORT environment variable. In containerized environments, %%host%% should be used for host autodetection.

apiVersion: v1
kind: Pod
# (...)
metadata:
  name: '<POD_NAME>'
  annotations:
    ad.datadoghq.com/controller.checks: |
      {
        "kubernetes_cluster_autoscaler": {
          "init_config": {},
          "instances": [
            {
              "openmetrics_endpoint": "http://%%host%%:8085/metrics"
            }
          ]
        }
      }      
    # (...)
spec:
  containers:
    - name: 'controller'
# (...)

Validation

Run the Agent’s status subcommand and look for kubernetes_cluster_autoscaler under the Checks section.

Data Collected

Metrics

kubernetes_cluster_autoscaler.cluster.cpu.current.cores
(gauge)
Current CPU cores usage in the cluster
kubernetes_cluster_autoscaler.cluster.memory.current.bytes
(gauge)
Current memory usage in bytes in the cluster
kubernetes_cluster_autoscaler.cluster.safe.to.autoscale
(gauge)
Indicates whether the cluster is safe to autoscale
kubernetes_cluster_autoscaler.cpu.limits.cores
(gauge)
Total CPU cores limits set for pods in the cluster
kubernetes_cluster_autoscaler.created.node.groups.count
(count)
Total count of node groups created in the cluster
kubernetes_cluster_autoscaler.deleted.node.groups.count
(count)
Total count of node groups deleted in the cluster
kubernetes_cluster_autoscaler.errors.count
(count)
Total count of errors occurred in the cluster
kubernetes_cluster_autoscaler.evicted.pods.count
(count)
Total count of evicted pods in the cluster
kubernetes_cluster_autoscaler.failed.scale.ups.count
(count)
Total count of failed scale-up operations in the cluster
kubernetes_cluster_autoscaler.function.duration.seconds.bucket
(count)
Duration of a specific function in the cluster (bucket)
kubernetes_cluster_autoscaler.function.duration.seconds.count
(count)
Duration of a specific function in the cluster (count)
kubernetes_cluster_autoscaler.function.duration.seconds.sum
(count)
Duration of a specific function in the cluster (sum)
kubernetes_cluster_autoscaler.go.gc.duration.seconds.count
(count)
A summary of the pause duration of garbage collection cycles.
Shown as second
kubernetes_cluster_autoscaler.go.gc.duration.seconds.quantile
(gauge)
A summary of the pause duration of garbage collection cycles
Shown as second
kubernetes_cluster_autoscaler.go.gc.duration.seconds.sum
(count)
A summary of the pause duration of garbage collection cycles
Shown as second
kubernetes_cluster_autoscaler.go.goroutines
(gauge)
Number of goroutines that currently exist
kubernetes_cluster_autoscaler.go.info
(gauge)
Information about the Go environment
kubernetes_cluster_autoscaler.go.memstats.alloc_bytes
(gauge)
Number of bytes allocated and still in use
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.alloc_bytes.count
(count)
Total number of bytes allocated even if freed
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.buck_hash.sys_bytes
(gauge)
Number of bytes used by the profiling bucket hash table
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.frees.count
(count)
Total number of frees
kubernetes_cluster_autoscaler.go.memstats.gc.sys_bytes
(gauge)
Number of bytes used for garbage collection system metadata
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.heap.alloc_bytes
(gauge)
Number of heap bytes allocated and still in use
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.heap.idle_bytes
(gauge)
Number of heap bytes waiting to be used
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.heap.inuse_bytes
(gauge)
Number of heap bytes that are in use
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.heap.objects
(gauge)
Number of allocated objects
Shown as object
kubernetes_cluster_autoscaler.go.memstats.heap.released_bytes
(gauge)
Number of heap bytes released to OS
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.heap.sys_bytes
(gauge)
Number of heap bytes obtained from system
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.lookups.count
(count)
Total number of pointer lookups
kubernetes_cluster_autoscaler.go.memstats.mallocs.count
(count)
Total number of mallocs
kubernetes_cluster_autoscaler.go.memstats.mcache.inuse_bytes
(gauge)
Number of bytes in use by mcache structures
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.mcache.sys_bytes
(gauge)
Number of bytes used for mcache structures obtained from system
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.mspan.inuse_bytes
(gauge)
Number of bytes in use by mspan structures
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.mspan.sys_bytes
(gauge)
Number of bytes used for mspan structures obtained from system
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.next.gc_bytes
(gauge)
Number of heap bytes when next garbage collection will take place
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.other.sys_bytes
(gauge)
Number of bytes used for other system allocations
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.stack.inuse_bytes
(gauge)
Number of bytes in use by the stack allocator
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.stack.sys_bytes
(gauge)
Number of bytes obtained from system for stack allocator
Shown as byte
kubernetes_cluster_autoscaler.go.memstats.sys_bytes
(gauge)
Number of bytes obtained from system
Shown as byte
kubernetes_cluster_autoscaler.go.threads
(gauge)
Number of OS threads created
Shown as thread
kubernetes_cluster_autoscaler.last.activity
(gauge)
Timestamp of the last activity in the cluster
kubernetes_cluster_autoscaler.max.nodes.count
(gauge)
Maximum number of nodes allowed in the cluster
kubernetes_cluster_autoscaler.memory.limits.bytes
(gauge)
Total memory limits set for pods in the cluster
kubernetes_cluster_autoscaler.nap.enabled
(gauge)
Indicates whether Node Auto-Provisioning (NAP) is enabled in the cluster
kubernetes_cluster_autoscaler.node.groups.count
(gauge)
Number of node groups in the cluster
kubernetes_cluster_autoscaler.nodes.count
(gauge)
Number of nodes in cluster
kubernetes_cluster_autoscaler.old.unregistered.nodes.removed.count
(count)
Total count of old unregistered nodes removed from the cluster
kubernetes_cluster_autoscaler.scaled.down.gpu.nodes.count
(count)
Total count of GPU nodes scaled down in the cluster
kubernetes_cluster_autoscaler.scaled.down.nodes.count
(count)
Total count of nodes scaled down in the cluster
kubernetes_cluster_autoscaler.scaled.up.gpu.nodes.count
(count)
Total count of GPU nodes scaled up in the cluster
kubernetes_cluster_autoscaler.scaled.up.nodes.count
(count)
Total count of nodes scaled up in the cluster
kubernetes_cluster_autoscaler.skipped.scale.events.count
(count)
Total count of skipped scale events in the cluster
kubernetes_cluster_autoscaler.unneeded.nodes.count
(gauge)
Total count of unneeded nodes in the cluster
kubernetes_cluster_autoscaler.unschedulable.pods.count
(gauge)
Number of unschedulable pods in the cluster

Events

The Kubernetes Cluster Autoscaler integration does not include any events.

Service Checks

kubernetes_cluster_autoscaler.openmetrics.health
Returns CRITICAL if the Agent is unable to connect to the Kubernetes Cluster Autoscaler OpenMetrics endpoint, otherwise returns OK.
Statuses: ok, critical

Troubleshooting

Need help? Contact Datadog support.