Karpenter

Supported OS Linux Windows Mac OS

Integration version1.2.0

Overview

This check monitors Karpenter through the Datadog Agent. For more information, see Karpenter monitoring.

Setup

Follow the instructions below to install and configure this check for an Agent running in your Kubernetes environment. For more information about configuration in containerized environments, see the Autodiscovery Integration Templates for guidance.

Installation

Starting from Agent release 7.50.0, the Karpenter check is included in the Datadog Agent package. No additional installation is needed in your environment.

This check uses OpenMetrics to collect metrics from the OpenMetrics endpoint that Karpenter exposes, which requires Python 3.

Configuration

Metric collection

Make sure that the Prometheus-formatted metrics are exposed in your Karpenter cluster and on which port. You can configure the port by following the instructions on the Metrics page in the Karpenter documentation. For the Agent to start collecting metrics, the Karpenter pods need to be annotated. For more information about annotations, refer to the Autodiscovery Integration Templates for guidance. You can find additional configuration options by reviewing the sample karpenter.d/conf.yaml.

Note: The listed metrics can only be collected if they are available. Some metrics are generated only when certain actions are performed. For example, the karpenter.nodes.terminated metric is exposed only after a node is terminated.

The only parameter required for configuring the Karpenter check is:

  • openmetrics_endpoint: This parameter should be set to the location where the Prometheus-formatted metrics are exposed. The default port is 8000, but it can be configured using the METRICS_PORT environment variable. In containerized environments, %%host%% should be used for host autodetection.
apiVersion: v1
kind: Pod
# (...)
metadata:
  name: '<POD_NAME>'
  annotations:
    ad.datadoghq.com/controller.checks: |
      {
        "karpenter": {
          "init_config": {},
          "instances": [
            {
              "openmetrics_endpoint": "http://%%host%%:8000/metrics",
            }
          ]
        }
      }      
    # (...)
spec:
  containers:
    - name: 'controller'
# (...)

Log collection

Available for Agent versions >6.0

Karpenter logs can be collected from the different Karpenter pods through Kubernetes. Collecting logs is disabled by default in the Datadog Agent. To enable it, see Kubernetes Log Collection.

See the Autodiscovery Integration Templates for guidance on applying the parameters below.

ParameterValue
<LOG_CONFIG>{"source": "karpenter", "service": "<SERVICE_NAME>"}

Validation

Run the Agent’s status subcommand and look for karpenter under the Checks section.

Data Collected

Metrics

karpenter.disruption.actions_performed.count
(count)
The count of disruption actions performed. Labeled by disruption method
Shown as execution
karpenter.disruption.consolidation_timeouts.count
(count)
The count of times the Consolidation algorithm has reached a timeout. Labeled by consolidation type
Shown as timeout
karpenter.disruption.eligible_nodes
(gauge)
Number of nodes eligible for disruption by Karpenter. Labeled by disruption method
Shown as node
karpenter.disruption.evaluation.duration_seconds.bucket
(count)
The count of observations in the disruption evaluation histogram by upper_bound buckets
karpenter.disruption.evaluation.duration_seconds.count
(count)
The count of observations in the disruption evaluation histogram
karpenter.disruption.evaluation.duration_seconds.sum
(count)
The sum of the duration of the disruption evaluation process in seconds
Shown as second
karpenter.disruption.replacement.nodeclaim.failures.count
(count)
The number of times that Karpenter failed to launch a replacement node for disruption. Labeled by disruption method
Shown as attempt
karpenter.disruption.replacement.nodeclaim.initialized_seconds.bucket
(count)
The count of observations in the replacement nodeclaim histogram by upper_bound buckets
karpenter.disruption.replacement.nodeclaim.initialized_seconds.count
(count)
The count of observations in the replacement nodeclaim histogram
karpenter.disruption.replacement.nodeclaim.initialized_seconds.sum
(count)
The sum of the amount of time required for a replacement nodeclaim to become initialized
Shown as second
karpenter.nodeclaims_created
(gauge)
Number of nodeclaims created in total by Karpenter. Labeled by reason the nodeclaim was created and the owning nodepool
karpenter.nodeclaims_disrupted
(gauge)
Number of nodeclaims disrupted in total by Karpenter. Labeled by disruption type of the nodeclaim and the owning nodepool
karpenter.nodeclaims_drifted
(gauge)
Number of nodeclaims drifted reasons in total by Karpenter. Labeled by drift type of the nodeclaim and the owning nodepool
karpenter.nodeclaims_initialized
(gauge)
Number of nodeclaims initialized in total by Karpenter. Labeled by the owning nodepool
karpenter.nodeclaims_launched
(gauge)
Number of nodeclaims launched in total by Karpenter. Labeled by the owning nodepool
karpenter.nodeclaims_registered
(gauge)
Number of nodeclaims registered in total by Karpenter. Labeled by the owning nodepool
karpenter.nodeclaims_terminated
(gauge)
Number of nodeclaims terminated in total by Karpenter. Labeled by reason the nodeclaim was terminated and the owning nodepool
karpenter.nodepool_limit
(gauge)
The nodepool limits are the limits specified on the provisioner that restrict the quantity of resources provisioned. Labeled by nodepool name and resource type
karpenter.nodepool_usage
(gauge)
The nodepool usage is the amount of resources that have been provisioned by a particular nodepool. Labeled by nodepool name and resource type
karpenter.pods.state
(gauge)
Pod state is the current state of pods. This metric can be used several ways as it is labeled by the pod name, namespace, owner, node, provisioner name, zone, architecture, capacity type, instance type and pod phase.
karpenter.certwatcher.read.certificate.count
(count)
The count of certificate reads
Shown as read
karpenter.certwatcher.read.certificate.errors.count
(count)
The count of certificate read errors
Shown as error
karpenter.cloudprovider.batcher.batch.time_seconds.bucket
(count)
The count of observation in the batching window histogram by upper_bound buckets
karpenter.cloudprovider.batcher.batch.time_seconds.count
(count)
The count of observation in the batching window histogram
karpenter.cloudprovider.batcher.batch.time_seconds.sum
(count)
The sum of the duration of the batching window per batcher
Shown as second
karpenter.cloudprovider.batcher.batch_size.bucket
(count)
The count of observation in the request batch histogram by upper_bound buckets
karpenter.cloudprovider.batcher.batch_size.count
(count)
The count of observation in the request batch histogram
karpenter.cloudprovider.batcher.batch_size.sum
(count)
The sum of the size of the request batch per batcher
karpenter.cloudprovider.duration_seconds.bucket
(count)
The count of observations in the duration of cloud provider histogram by upper_bound buckets,method name and provider
karpenter.cloudprovider.duration_seconds.count
(count)
The count of observations in the duration of cloud provider histogram
karpenter.cloudprovider.duration_seconds.sum
(count)
The sum of the duration of cloud provider method calls. Labeled by the controller
Shown as second
karpenter.cloudprovider.errors.count
(count)
The count of errors returned from CloudProvider calls
Shown as error
karpenter.cloudprovider.instance.type.cpu_cores
(gauge)
VCPUs cores for a given instance type
Shown as core
karpenter.cloudprovider.instance.type.memory_bytes
(gauge)
Memory, in bytes, for a given instance type
Shown as byte
karpenter.cloudprovider.instance.type.price_estimate
(gauge)
Estimated hourly price used when making informed decisions on node cost calculation. This is updated once on startup and then every 12 hours
karpenter.consistency.errors
(gauge)
Number of consistency checks that have failed
Shown as error
karpenter.controller.runtime.active_workers
(gauge)
Number of currently used workers per controller
Shown as worker
karpenter.controller.runtime.max.concurrent_reconciles
(gauge)
Maximum number of concurrent reconciles per controller
karpenter.controller.runtime.reconcile.count
(count)
The count of reconciliations per controller
karpenter.controller.runtime.reconcile.time_seconds.bucket
(count)
The count of observations in the reconciliation per controller histogram by upper_bound buckets
karpenter.controller.runtime.reconcile.time_seconds.count
(count)
The count of observations in the reconciliation per controller histogram
karpenter.controller.runtime.reconcile.time_seconds.sum
(count)
The sum of time per reconciliation per controller
Shown as second
karpenter.controller.runtime.reconcile_errors.count
(count)
The count of reconciliation errors per controller
Shown as error
karpenter.deprovisioning.actions_performed.count
(count)
The count of deprovisioning actions performed. Labeled by deprovisioner
Shown as execution
karpenter.deprovisioning.consolidation_timeouts
(gauge)
Number of times the Consolidation algorithm has reached a timeout. Labeled by consolidation type
Shown as timeout
karpenter.deprovisioning.eligible_machines
(gauge)
Number of machines eligible for deprovisioning by Karpenter. Labeled by deprovisioner
karpenter.deprovisioning.evaluation.duration_seconds.bucket
(count)
The count of observations in the deprovisioning evaluation histogram by upper_bound buckets
karpenter.deprovisioning.evaluation.duration_seconds.count
(count)
The count of observations in the deprovisioning evaluation histogram
karpenter.deprovisioning.evaluation.duration_seconds.sum
(count)
The sum of the duration of the deprovisioning evaluation process in seconds
Shown as second
karpenter.deprovisioning.replacement.machine.initialized_seconds.bucket
(count)
The count of the observation in the replacement machine histogram by upper_bound buckets
karpenter.deprovisioning.replacement.machine.initialized_seconds.count
(count)
The count of the observation in the replacement machine histogram
karpenter.deprovisioning.replacement.machine.initialized_seconds.sum
(count)
The sum of the time required for a replacement machine to become initialized
Shown as second
karpenter.deprovisioning.replacement.machine.launch.failure_counter.count
(count)
The count of times that Karpenter failed to launch a replacement node for deprovisioning. Labeled by deprovisioner
Shown as attempt
karpenter.go.gc.duration_seconds.count
(count)
The summary count of garbage collection cycles in the Karpenter instance
karpenter.go.gc.duration_seconds.quantile
(gauge)
The pause duration of garbage collection cycles in the Karpenter instance by quantile
karpenter.go.gc.duration_seconds.sum
(count)
The sum of the pause duration of garbage collection cycles in the Karpenter instance
Shown as second
karpenter.go.memstats.alloc_bytes
(gauge)
Number of bytes allocated and still in use
Shown as byte
karpenter.go.memstats.alloc_bytes.count
(count)
Count of bytes allocated, even if freed.
Shown as byte
karpenter.go.memstats.buck.hash.sys_bytes
(gauge)
Number of bytes used by the profiling bucket hash table
Shown as byte
karpenter.go.memstats.frees.count
(count)
The count of frees
karpenter.go.memstats.gc.sys_bytes
(gauge)
Number of bytes used for garbage collection system metadata
Shown as byte
karpenter.go.memstats.heap.alloc_bytes
(gauge)
Number of heap bytes allocated and still in use
Shown as byte
karpenter.go.memstats.heap.idle_bytes
(gauge)
Number of heap bytes waiting to be used
Shown as byte
karpenter.go.memstats.heap.inuse_bytes
(gauge)
Number of heap bytes that are in use
Shown as byte
karpenter.go.memstats.heap.objects
(gauge)
Number of allocated objects
Shown as object
karpenter.go.memstats.heap.released_bytes
(gauge)
Number of heap bytes released to OS
Shown as byte
karpenter.go.memstats.heap.sys_bytes
(gauge)
Number of heap bytes obtained from system
Shown as byte
karpenter.go.memstats.last.gc.time_seconds
(gauge)
Number of seconds since 1970 of last garbage collection
Shown as second
karpenter.go.memstats.lookups.count
(count)
The count of pointer lookups
karpenter.go.memstats.mallocs.count
(count)
The count of mallocs
karpenter.go.memstats.mcache.inuse_bytes
(gauge)
Number of bytes in use by mcache structures
Shown as byte
karpenter.go.memstats.mcache.sys_bytes
(gauge)
Number of bytes used for mcache structures obtained from system
Shown as byte
karpenter.go.memstats.mspan.inuse_bytes
(gauge)
Number of bytes in use by mspan structures
Shown as byte
karpenter.go.memstats.mspan.sys_bytes
(gauge)
Number of bytes used for mspan structures obtained from system
Shown as byte
karpenter.go.memstats.next.gc_bytes
(gauge)
Number of heap bytes when next garbage collection will take place
Shown as byte
karpenter.go.memstats.other.sys_bytes
(gauge)
Number of bytes used for other system allocations
Shown as byte
karpenter.go.memstats.stack.inuse_bytes
(gauge)
Number of bytes in use by the stack allocator
Shown as byte
karpenter.go.memstats.stack.sys_bytes
(gauge)
Number of bytes obtained from system for stack allocator
Shown as byte
karpenter.go.memstats.sys_bytes
(gauge)
Number of bytes obtained from system
Shown as byte
karpenter.go_goroutines
(gauge)
Number of goroutines that currently exist
karpenter.go_info
(gauge)
Information about the Go environment
karpenter.go_threads
(gauge)
Number of OS threads created
Shown as thread
karpenter.interruption.actions_performed.count
(count)
The count of notification actions performed. Labeled by action
Shown as execution
karpenter.interruption.deleted_messages.count
(count)
The count of messages deleted from the SQS queue
Shown as message
karpenter.interruption.message.latency.time_seconds.bucket
(count)
The count of observations in the interruption message latency histogram by upper_bound buckets
karpenter.interruption.message.latency.time_seconds.count
(count)
The count of observations in the interruption message latency histogram
karpenter.interruption.message.latency.time_seconds.sum
(count)
The sum of the length of time between message creation in queue and an action taken on the message by the controller
Shown as second
karpenter.interruption.received_messages.count
(count)
The count of messages received from the SQS queue. Broken down by message type and whether the message was actionable
Shown as message
karpenter.machines_created.count
(count)
The count of machines created in total by Karpenter. Labeled by reason the machine was created and the owning provisioner
karpenter.machines_disrupted.count
(count)
The count of machines disrupted in total by Karpenter. Labeled by disruption type of the machine and the owning provisioner
karpenter.machines_drifted.count
(count)
The count of machine drifted reasons in total by Karpenter. Labeled by drift type of the machine and the owning provisioner
karpenter.machines_initialized.count
(count)
The count of machines initialized in total by Karpenter. Labeled by the owning provisioner
karpenter.machines_launched.count
(count)
The count of machines launched in total by Karpenter. Labeled by the owning provisioner
karpenter.machines_registered.count
(count)
The count of machines registered in total by Karpenter. Labeled by the owning provisioner
karpenter.machines_terminated.count
(count)
The count of machines terminated in total by Karpenter. Labeled by reason the machine was terminated and the owning provisioner
karpenter.nodes.allocatable
(gauge)
The amount of resources allocatable by nodes
karpenter.nodes.created.count
(count)
The count of nodes created in total by Karpenter. Labeled by owning provisioner
Shown as node
karpenter.nodes.leases_deleted.count
(count)
The count of deleted leaked leases
karpenter.nodes.system_overhead
(gauge)
The resources reserved for system overhead, the difference between the nodes capacity and allocatable values are reported by the status.
karpenter.nodes.terminated.count
(count)
The count of nodes terminated in total by Karpenter. Labeled by owning provisioner
Shown as node
karpenter.nodes.termination.time_seconds.count
(count)
The count of observations in the nodes termination time seconds summary
karpenter.nodes.termination.time_seconds.quantile
(gauge)
The time taken between a nodes deletion request and the removal of its finalizer by quantile
karpenter.nodes.termination.time_seconds.sum
(count)
The sum of the time taken between a nodes deletion request and the removal of its finalizer
Shown as second
karpenter.nodes.total.daemon_limits
(gauge)
Total resources specified by DaemonSet pod limits
karpenter.nodes.total.daemon_requests
(gauge)
Total resources requested by DaemonSet pods
karpenter.nodes.total.pod_limits
(gauge)
Total pod resources specified by non-DaemonSet pod limits
karpenter.nodes.total.pod_requests
(gauge)
Total pod resources requested by non-DaemonSet pods bound
karpenter.pods.startup.time_seconds.count
(count)
The count of the observations in the pod startup summary
karpenter.pods.startup.time_seconds.quantile
(gauge)
The time taken between pod creation and the pod being in a running state by quantile
karpenter.pods.startup.time_seconds.sum
(count)
The sum of the time from pod creation and the pod being in a running state
Shown as second
karpenter.process.cpu_seconds.count
(count)
Total user and system CPU time spent in seconds
Shown as second
karpenter.process.max_fds
(gauge)
Maximum number of open file descriptors
karpenter.process.open_fds
(gauge)
Number of open file descriptors
karpenter.process.resident.memory_bytes
(gauge)
Resident memory size in bytes
Shown as byte
karpenter.process.start.time_seconds
(gauge)
Start time of the process since unix epoch in seconds
Shown as second
karpenter.process.virtual.memory.max_bytes
(gauge)
Maximum amount of virtual memory available in bytes
Shown as byte
karpenter.process.virtual.memory_bytes
(gauge)
Virtual memory size in bytes
Shown as byte
karpenter.provisioner.limit
(gauge)
The limits specified on the provisioner that restrict the quantity of resources provisioned. Labeled by provisioner name and resource type
karpenter.provisioner.scheduling.duration_seconds.bucket
(count)
The count of observations in the provisioner scheduling histogram by upper_bound buckets
karpenter.provisioner.scheduling.duration_seconds.count
(count)
The count of observations in the provisioner scheduling histogram
karpenter.provisioner.scheduling.duration_seconds.sum
(count)
The sum of the duration of scheduling process in seconds. Broken down by provisioner and error
Shown as second
karpenter.provisioner.scheduling.simulation.duration_seconds.bucket
(count)
The count of observations in the provisioner scheduling simulation histogram by upper_bound buckets
karpenter.provisioner.scheduling.simulation.duration_seconds.count
(count)
The count of observations in the provisioner scheduling simulation histogram
karpenter.provisioner.scheduling.simulation.duration_seconds.sum
(count)
The sum of the duration of scheduling simulations used for deprovisioning and provisioning in seconds
Shown as second
karpenter.provisioner.usage
(gauge)
The amount of resources that have been provisioned by a particular provisioner. Labeled by provisioner name and resource type
karpenter.provisioner.usage.pct
(gauge)
The percentage of each resource used based on the resources provisioned and the limits that have been configured in the range [0,100]. Labeled by provisioner name and resource type
Shown as percent
karpenter.rest.client_requests.count
(count)
Count of HTTP requests, partitioned by status code, method, and host.
Shown as request
karpenter.workqueue.longest.running.processor_seconds
(gauge)
The amount of seconds the longest running processor for workqueue been running
Shown as second
karpenter.workqueue.queue.duration_seconds.bucket
(count)
The count of observations in the workqueue queue duration histogram by upper_bound buckets
karpenter.workqueue.queue.duration_seconds.count
(count)
The count of observations in the workqueue queue duration histogram
karpenter.workqueue.queue.duration_seconds.sum
(count)
The sum of the duration of how long in seconds an item stays in workqueue before being requested
Shown as second
karpenter.workqueue.unfinished.work_seconds
(gauge)
The amount of seconds of work that has been done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases
karpenter.workqueue.work.duration_seconds.bucket
(count)
The count of observations in the workqueue work duration histogram by upper_bound buckets
karpenter.workqueue.work.duration_seconds.count
(count)
The count of observations in the workqueue work duration histogram
karpenter.workqueue.work.duration_seconds.sum
(count)
The sum of the amount of seconds spent processing an item from workqueue takes
Shown as second
karpenter.workqueue_adds.count
(count)
The count of adds handled by workqueue
karpenter.workqueue_depth
(gauge)
Current depth of workqueue
karpenter.workqueue_retries.count
(count)
The count of retries handled by workqueue
Shown as attempt

Events

The Karpenter integration does not include any events.

Service Checks

karpenter.openmetrics.health
Returns CRITICAL if the Agent is unable to connect to the Karpenter OpenMetrics endpoint, otherwise returns OK.
Statuses: ok, critical

Troubleshooting

Need help? Contact Datadog support.