Kubernetes State Metrics Core

Supported OS Linux Mac OS Windows

Overview

Get metrics from Kubernetes service in real-time to:

  • Visualize and monitor Kubernetes states.
  • Be notified about Kubernetes failovers and events.

The Kubernetes State Metrics Core check leverages kube-state-metrics version 2+ and includes major performance and tagging improvements compared to the legacy kubernetes_state check.

As opposed to the legacy check, with the Kubernetes State Metrics Core check, you no longer need to deploy kube-state-metrics in your cluster.

Kubernetes State Metrics Core provides a better alternative to the legacy kubernetes_state check as it offers more granular metrics and tags. See the Major Changes and Data Collected for more details.

Setup

Installation

The Kubernetes State Metrics Core check is included in the Datadog Cluster Agent image, so you don’t need to install anything else on your Kubernetes servers.

Requirements

  • Datadog Cluster Agent v1.12+

Configuration

In your Helm values.yaml, add the following:

datadog:
  # (...)
  kubeStateMetricsCore:
    enabled: true

To enable the kubernetes_state_core check, the setting spec.features.kubeStateMetricsCore.enabled must be set to true in the DatadogAgent resource:

kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
  name: datadog
spec:
  features:
    kubeStateMetricsCore:
      enabled: true
  global:
    credentials:
      apiKey: <DATADOG_API_KEY>
      appKey: <DATADOG_APP_KEY>

Note: Datadog Operator v0.7.0 or greater is required.

Migration from kubernetes_state to kubernetes_state_core

Tags removal

In the original kubernetes_state check, several tags have been flagged as deprecated and replaced by new tags. To determine your migration path, check which tags are submitted with your metrics.

In the kubernetes_state_core check, only the non-deprecated tags are submitted. Before migrating from kubernetes_state to kubernetes_state_core, verify that only official tags are used in monitors and dashboards.

Here is the mapping between deprecated tags and the official tags that have replaced them:

deprecated tagofficial tag
cluster_namekube_cluster_name
containerkube_container_name
cronjobkube_cronjob
daemonsetkube_daemon_set
deploymentkube_deployment
hpahorizontalpodautoscaler
imageimage_name
jobkube_job
job_namekube_job
namespacekube_namespace
phasepod_phase
podpod_name
replicasetkube_replica_set
replicationcontrollerkube_replication_controller
statefulsetkube_stateful_set

Backward incompatibility changes

The Kubernetes State Metrics Core check is not backward compatible, be sure to read the changes carefully before migrating from the legacy kubernetes_state check.

kubernetes_state.node.by_condition
A new metric with node name granularity. The legacy metric kubernetes_state.nodes.by_condition is deprecated in favor of this one. Note: This metric is backported into the Legacy check, where both metrics (it and the legacy metric it replaces) are available.
kubernetes_state.persistentvolume.by_phase
A new metric with persistentvolume name granularity. It replaces kubernetes_state.persistentvolumes.by_phase.
kubernetes_state.pod.status_phase
The metric is tagged with pod level tags, like pod_name.
kubernetes_state.node.count
The metric is not tagged with host anymore. It aggregates the nodes count by kernel_version os_image container_runtime_version kubelet_version.
kubernetes_state.container.waiting and kubernetes_state.container.status_report.count.waiting
These metrics no longer emit a 0 value if no pods are waiting. They only report non-zero values.
kube_job
In kubernetes_state, the kube_job tag value is the CronJob name if the Job had CronJob as an owner, otherwise it is the Job name. In kubernetes_state_core, the kube_job tag value is always the Job name, and a new kube_cronjob tag key is added with the CronJob name as the tag value. When migrating to kubernetes_state_core, it’s recommended to use the new tag or kube_job:foo*, where foo is the CronJob name, for query filters.
kubernetes_state.job.succeeded
In kubernetes_state, the kuberenetes.job.succeeded was count type. In kubernetes_state_core it is gauge type.

Enabling kubeStateMetricsCore in your Helm values.yaml configures the Agent to ignore the auto configuration file for legacy kubernetes_state check. The goal is to avoid running both checks simultaneously.

If you still want to enable both checks simultaneously for the migration phase, disable the ignoreLegacyKSMCheck field in your values.yaml.

Note: ignoreLegacyKSMCheck makes the Agent only ignore the auto configuration for the legacy kubernetes_state check. Custom kubernetes_state configurations need to be removed manually.

The Kubernetes State Metrics Core check does not require deploying kube-state-metrics in your cluster anymore, you can disable deploying kube-state-metrics as part of the Datadog Helm Chart. To do this, add the following in your Helm values.yaml:

datadog:
  # (...)
  kubeStateMetricsEnabled: false

Important Note: The Kubernetes State Metrics Core check is an alternative to the legacy kubernetes_state check. Datadog recommends not enabling both checks simultaneously to guarantee consistent metrics.

Data Collected

Metrics

kubernetes_state.apiservice.count
Number of Kubernetes API services.
kubernetes_state.apiservice.condition
The condition of this API service. Tags:apiservice condition status.
kubernetes_state.configmap.count
Number of ConfigMaps. Tags:kube_namespace.
kubernetes_state.daemonset.count
Number of DaemonSets. Tags:kube_namespace.
kubernetes_state.daemonset.scheduled
The number of nodes running at least one daemon pod and are supposed to. Tags:kube_daemon_set kube_namespace (env service version from standard labels).
kubernetes_state.daemonset.desired
The number of nodes that should be running the daemon pod. Tags:kube_daemon_set kube_namespace (env service version from standard labels).
kubernetes_state.daemonset.misscheduled
The number of nodes running a daemon pod but are not supposed to. Tags:kube_daemon_set kube_namespace (env service version from standard labels).
kubernetes_state.daemonset.ready
The number of nodes that should be running the daemon pod and have one or more of the daemon pod running and ready. Tags:kube_daemon_set kube_namespace (env service version from standard labels).
kubernetes_state.daemonset.updated
The total number of nodes that are running updated daemon pod. Tags:kube_daemon_set kube_namespace (env service version from standard labels).
kubernetes_state.daemonset.daemons_unavailable
The number of nodes that should be running the daemon pod and have none of the daemon pod running and available. Tags:kube_daemon_set kube_namespace (env service version from standard labels).
kubernetes_state.daemonset.daemons_available
The number of nodes that should be running the daemon pod and have one or more of the daemon pod running and available. Tags:kube_daemon_set kube_namespace (env service version from standard labels).
kubernetes_state.deployment.count
Number of deployments. Tags:kube_namespace.
kubernetes_state.deployment.paused
Whether the deployment is paused and will not be processed by the deployment controller. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.replicas_desired
Number of desired pods for a deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.rollingupdate.max_unavailable
Maximum number of unavailable replicas during a rolling update of a deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.rollingupdate.max_surge
Maximum number of replicas that can be scheduled above the desired number of replicas during a rolling update of a deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.replicas
The number of replicas per deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.replicas_available
The number of available replicas per deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.replicas_ready
The number of ready replicas per deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.replicas_unavailable
The number of unavailable replicas per deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.replicas_updated
The number of updated replicas per deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.condition
The current status conditions of a deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.endpoint.count
Number of endpoints. Tags:kube_namespace.
kubernetes_state.endpoint.address_available
Number of addresses available in endpoint. Tags:kube_endpoint kube_namespace.
kubernetes_state.endpoint.address_not_ready
Number of addresses not ready in endpoint. Tags:kube_endpoint kube_namespace.
kubernetes_state.namespace.count
Number of namespaces. Tags:phase.
kubernetes_state.node.count
Number of nodes. Tags:kernel_version os_image container_runtime_version kubelet_version.
kubernetes_state.node.cpu_allocatable
The allocatable CPU of a node that is available for scheduling. Tags:node resource unit.
kubernetes_state.node.memory_allocatable
The allocatable memory of a node that is available for scheduling. Tags:node resource unit.
kubernetes_state.node.pods_allocatable
The allocatable memory of a node that is available for scheduling. Tags:node resource unit.
kubernetes_state.node.ephemeral_storage_allocatable
The allocatable ephemeral-storage of a node that is available for scheduling. Tags:node resource unit.
kubernetes_state.node.network_bandwidth_allocatable
The allocatable network bandwidth of a node that is available for scheduling. Tags:node resource unit.
kubernetes_state.node.cpu_capacity
The CPU capacity of a node. Tags:node resource unit.
kubernetes_state.node.memory_capacity
The memory capacity of a node. Tags:node resource unit.
kubernetes_state.node.pods_capacity
The pods capacity of a node. Tags:node resource unit.
kubernetes_state.node.ephemeral_storage_capacity
The ephemeral-storage capacity of a node. Tags:node resource unit.
kubernetes_state.node.network_bandwidth_capacity
The network bandwidth capacity of a node. Tags:node resource unit.
kubernetes_state.node.by_condition
The condition of a cluster node. Tags:condition node status.
kubernetes_state.node.status
Whether the node can schedule new pods. Tags:node status.
kubernetes_state.node.age
The time in seconds since the creation of the node. Tags:node.
kubernetes_state.container.terminated
Describes whether the container is currently in a terminated state. Tags:kube_namespace pod_name kube_container_name (env service version from standard labels).
kubernetes_state.container.cpu_limit
The value of CPU limit by a container. Tags:kube_namespace pod_name kube_container_name node resource unit (env service version from standard labels).
kubernetes_state.container.memory_limit
The value of memory limit by a container. Tags:kube_namespace pod_name kube_container_name node resource unit (env service version from standard labels).
kubernetes_state.container.network_bandwidth_limit
The value of network bandwidth limit by a container. Tags:kube_namespace pod_name kube_container_name node resource unit (env service version from standard labels).
kubernetes_state.container.cpu_requested
The value of CPU requested by a container. Tags:kube_namespace pod_name kube_container_name node resource unit (env service version from standard labels).
kubernetes_state.container.memory_requested
The value of memory requested by a container. Tags:kube_namespace pod_name kube_container_name node resource unit (env service version from standard labels).
kubernetes_state.container.network_bandwidth_requested
The value of network bandwidth requested by a container. Tags:kube_namespace pod_name kube_container_name node resource unit (env service version from standard labels).
kubernetes_state.container.ready
Describes whether the containers readiness check succeeded. Tags:kube_namespace pod_name kube_container_name (env service version from standard labels).
kubernetes_state.container.restarts
The number of container restarts per container. Tags:kube_namespace pod_name kube_container_name (env service version from standard labels).
kubernetes_state.container.running
Describes whether the container is currently in a running state. Tags:kube_namespace pod_name kube_container_name (env service version from standard labels).
kubernetes_state.container.waiting
Describes whether the container is currently in a waiting state. Tags:kube_namespace pod_name kube_container_name (env service version from standard labels).
kubernetes_state.container.status_report.count.waiting
Describes the reason the container is currently in a waiting state. Tags:kube_namespace pod_name kube_container_name reason (env service version from standard labels).
kubernetes_state.container.status_report.count.terminated
Describes the reason the container is currently in a terminated state. Tags:kube_namespace pod_name kube_container_name reason (env service version from standard labels).
kubernetes_state.container.status_report.count.waiting
Describes the reason the container is currently in a waiting state. Tags:kube_namespace pod_name kube_container_name reason (env service version from standard labels).
kubernetes_state.container.status_report.count.terminated
Describes the reason the container is currently in a terminated state. Tags:kube_namespace pod_name kube_container_name reason (env service version from standard labels).
kubernetes_state.crd.count
Number of custom resource definition.
kubernetes_state.crd.condition
The condition of this custom resource definition. Tags:customresourcedefinition condition status.
kubernetes_state.pod.ready
Describes whether the pod is ready to serve requests. Tags:node kube_namespace pod_name condition (env service version from standard labels).
kubernetes_state.pod.scheduled
Describes the status of the scheduling process for the pod. Tags:node kube_namespace pod_name condition (env service version from standard labels).
kubernetes_state.pod.volumes.persistentvolumeclaims_readonly
Describes whether a persistentvolumeclaim is mounted read only. Tags:node kube_namespace pod_name volume persistentvolumeclaim (env service version from standard labels).
kubernetes_state.pod.unschedulable
Describes the unschedulable status for the pod. Tags:kube_namespace pod_name (env service version from standard labels).
kubernetes_state.pod.status_phase
The pods current phase. Tags:node kube_namespace pod_name pod_phase (env service version from standard labels).
kubernetes_state.pod.age
The time in seconds since the creation of the pod. Tags:node kube_namespace pod_name pod_phase (env service version from standard labels).
kubernetes_state.pod.uptime
The time in seconds since the pod has been scheduled and acknowledged by the Kubelet. Tags:node kube_namespace pod_name pod_phase (env service version from standard labels).
kubernetes_state.pod.count
Number of Pods. Tags:node kube_namespace kube_<owner kind>.
kubernetes_state.persistentvolumeclaim.status
The phase the persistent volume claim is currently in. Tags:kube_namespace persistentvolumeclaim phase storageclass.
kubernetes_state.persistentvolumeclaim.access_mode
The access mode(s) specified by the persistent volume claim. Tags:kube_namespace persistentvolumeclaim access_mode storageclass.
kubernetes_state.persistentvolumeclaim.request_storage
The capacity of storage requested by the persistent volume claim. Tags:kube_namespace persistentvolumeclaim storageclass.
kubernetes_state.persistentvolume.capacity
Persistentvolume capacity in bytes. Tags:persistentvolume storageclass.
kubernetes_state.persistentvolume.by_phase
The phase indicates if a volume is available, bound to a claim, or released by a claim. Tags:persistentvolume storageclass phase.
kubernetes_state.pdb.pods_healthy
Current number of healthy pods. Tags:kube_namespace poddisruptionbudget.
kubernetes_state.pdb.pods_desired
Minimum desired number of healthy pods. Tags:kube_namespace poddisruptionbudget.
kubernetes_state.pdb.disruptions_allowed
Number of pod disruptions that are currently allowed. Tags:kube_namespace poddisruptionbudget.
kubernetes_state.pdb.pods_total
Total number of pods counted by this disruption budget. Tags:kube_namespace poddisruptionbudget.
kubernetes_state.secret.count
Number of Secrets. Tags:kube_namespace
kubernetes_state.secret.type
Type about secret. Tags:kube_namespace secret type.
kubernetes_state.replicaset.count
Number of ReplicaSets Tags:kube_namespace kube_deployment.
kubernetes_state.replicaset.replicas_desired
Number of desired pods for a ReplicaSet. Tags:kube_namespace kube_replica_set (env service version from standard labels).
kubernetes_state.replicaset.fully_labeled_replicas
The number of fully labeled replicas per ReplicaSet. Tags:kube_namespace kube_replica_set (env service version from standard labels).
kubernetes_state.replicaset.replicas_ready
The number of ready replicas per ReplicaSet. Tags:kube_namespace kube_replica_set (env service version from standard labels).
kubernetes_state.replicaset.replicas
The number of replicas per ReplicaSet. Tags:kube_namespace kube_replica_set (env service version from standard labels).
kubernetes_state.replicationcontroller.replicas_desired
Number of desired pods for a ReplicationController. Tags:kube_namespace kube_replication_controller.
kubernetes_state.replicationcontroller.replicas_available
The number of available replicas per ReplicationController. Tags:kube_namespace kube_replication_controller.
kubernetes_state.replicationcontroller.fully_labeled_replicas
The number of fully labeled replicas per ReplicationController. Tags:kube_namespace kube_replication_controller.
kubernetes_state.replicationcontroller.replicas_ready
The number of ready replicas per ReplicationController. Tags:kube_namespace kube_replication_controller.
kubernetes_state.replicationcontroller.replicas
The number of replicas per ReplicationController. Tags:kube_namespace kube_replication_controller.
kubernetes_state.statefulset.count
Number of StatefulSets Tags:kube_namespace.
kubernetes_state.statefulset.replicas_desired
Number of desired pods for a StatefulSet. Tags:kube_namespace kube_stateful_set (env service version from standard labels).
kubernetes_state.statefulset.replicas
The number of replicas per StatefulSet. Tags:kube_namespace kube_stateful_set (env service version from standard labels).
kubernetes_state.statefulset.replicas_current
The number of current replicas per StatefulSet. Tags:kube_namespace kube_stateful_set (env service version from standard labels).
kubernetes_state.statefulset.replicas_ready
The number of ready replicas per StatefulSet. Tags:kube_namespace kube_stateful_set (env service version from standard labels).
kubernetes_state.statefulset.replicas_updated
The number of updated replicas per StatefulSet. Tags:kube_namespace kube_stateful_set (env service version from standard labels).
kubernetes_state.hpa.count
Number of horizontal pod autoscaler. Tags: kube_namespace.
kubernetes_state.hpa.min_replicas
Lower limit for the number of pods that can be set by the autoscaler, default 1. Tags:kube_namespace horizontalpodautoscaler.
kubernetes_state.hpa.max_replicas
Upper limit for the number of pods that can be set by the autoscaler; cannot be smaller than MinReplicas. Tags:kube_namespace horizontalpodautoscaler.
kubernetes_state.hpa.condition
The condition of this autoscaler. Tags:kube_namespace horizontalpodautoscaler condition status.
kubernetes_state.hpa.desired_replicas
Desired number of replicas of pods managed by this autoscaler. Tags:kube_namespace horizontalpodautoscaler.
kubernetes_state.hpa.current_replicas
Current number of replicas of pods managed by this autoscaler. Tags:kube_namespace horizontalpodautoscaler.
kubernetes_state.hpa.spec_target_metric
The metric specifications used by this autoscaler when calculating the desired replica count. Tags:kube_namespace horizontalpodautoscaler metric_name metric_target_type.
kubernetes_state.hpa.status_target_metric
The current metric status used by this autoscaler when calculating the desired replica count. Tags:kube_namespace horizontalpodautoscaler metric_name metric_target_type.
kubernetes_state.vpa.count
Number of vertical pod autoscaler. Tags: kube_namespace.
kubernetes_state.vpa.lower_bound
Minimum resources the container can use before the VerticalPodAutoscaler updater evicts it. Tags:kube_namespace verticalpodautoscaler kube_container_name resource target_api_version target_kind target_name unit.
kubernetes_state.vpa.target
Target resources the VerticalPodAutoscaler recommends for the container. Tags:kube_namespace verticalpodautoscaler kube_container_name resource target_api_version target_kind target_name unit.
kubernetes_state.vpa.uncapped_target
Target resources the VerticalPodAutoscaler recommends for the container ignoring bounds. Tags:kube_namespace verticalpodautoscaler kube_container_name resource target_api_version target_kind target_name unit.
kubernetes_state.vpa.upperbound
Maximum resources the container can use before the VerticalPodAutoscaler updater evicts it. Tags:kube_namespace verticalpodautoscaler kube_container_name resource target_api_version target_kind target_name unit.
kubernetes_state.vpa.update_mode
Update mode of the VerticalPodAutoscaler. Tags:kube_namespace verticalpodautoscaler target_api_version target_kind target_name update_mode.
kubernetes_state.vpa.spec_container_minallowed
Minimum resources the VerticalPodAutoscaler can set for containers matching the name. Tags:kube_namespace verticalpodautoscaler kube_container_name resource target_api_version target_kind target_name unit.
kubernetes_state.vpa.spec_container_maxallowed
Maximum resources the VerticalPodAutoscaler can set for containers matching the name. Tags:kube_namespace verticalpodautoscaler kube_container_name resource target_api_version target_kind target_name unit.
kubernetes_state.cronjob.count
Number of cronjobs. Tags:kube_namespace.
kubernetes_state.cronjob.spec_suspend
Suspend flag tells the controller to suspend subsequent executions. Tags:kube_namespace kube_cronjob (env service version from standard labels).
kubernetes_state.cronjob.duration_since_last_schedule
The duration since the last time the cronjob was scheduled. Tags:kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.job.count
Number of jobs. Tags:kube_namespace kube_cronjob.
kubernetes_state.job.failed
The number of pods which reached Phase Failed. Tags:kube_job or kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.job.succeeded
The number of pods which reached Phase Succeeded. Tags:kube_job or kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.job.completion.succeeded
The job has completed its execution. Tags:kube_job or kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.job.completion.failed
The job has failed its execution. Tags:kube_job or kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.job.duration
Time elapsed between the start and completion time of the job, or the current time if the job is still running. Tags:kube_job kube_namespace (env service version from standard labels).
kubernetes_state.resourcequota.<resource>.limit
Information about resource quota limits by resource. Tags:kube_namespace resourcequota.
kubernetes_state.resourcequota.<resource>.used
Information about resource quota usage by resource. Tags:kube_namespace resourcequota.
kubernetes_state.limitrange.cpu.min
Information about CPU limit range usage by constraint. Tags:kube_namespace limitrange type.
kubernetes_state.limitrange.cpu.max
Information about CPU limit range usage by constraint. Tags:kube_namespace limitrange type.
kubernetes_state.limitrange.cpu.default
Information about CPU limit range usage by constraint. Tags:kube_namespace limitrange type.
kubernetes_state.limitrange.cpu.default_request
Information about CPU limit range usage by constraint. Tags:kube_namespace limitrange type.
kubernetes_state.limitrange.cpu.max_limit_request_ratio
Information about CPU limit range usage by constraint. Tags:kube_namespace limitrange type.
kubernetes_state.limitrange.memory.min
Information about memory limit range usage by constraint. Tags:kube_namespace limitrange type.
kubernetes_state.limitrange.memory.max
Information about memory limit range usage by constraint. Tags:kube_namespace limitrange type.
kubernetes_state.limitrange.memory.default
Information about memory limit range usage by constraint. Tags:kube_namespace limitrange type.
kubernetes_state.limitrange.memory.default_request
Information about memory limit range usage by constraint. Tags:kube_namespace limitrange type.
kubernetes_state.limitrange.memory.max_limit_request_ratio
Information about memory limit range usage by constraint. Tags:kube_namespace limitrange type.
kubernetes_state.service.count
Number of services. Tags:kube_namespace type.
kubernetes_state.service.type
Service types. Tags:kube_namespace kube_service type.
kubernetes_state.ingress.count
Number of ingresses. Tags:kube_namespace.
kubernetes_state.ingress.path
Information about the ingress path. Tags:kube_namespace kube_ingress_path kube_ingress kube_service kube_service_port kube_ingress_host .

Note: You can configure Datadog Standard labels on your Kubernetes objects to get the env service version tags.

Events

The Kubernetes State Metrics Core check does not include any events.

Service Checks

kubernetes_state.cronjob.complete
Whether the last job of the cronjob is failed or not. Tags:kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.cronjob.on_schedule_check
Alert if the cronjob’s next schedule is in the past. Tags:kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.job.complete
Whether the job is failed or not. Tags:kube_job or kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.node.ready
Whether the node is ready. Tags:node condition status.
kubernetes_state.node.out_of_disk
Whether the node is out of disk. Tags:node condition status.
kubernetes_state.node.disk_pressure
Whether the node is under disk pressure. Tags:node condition status.
kubernetes_state.node.network_unavailable
Whether the node network is unavailable. Tags:node condition status.
kubernetes_state.node.memory_pressure
Whether the node network is under memory pressure. Tags:node condition status.

Validation

Run the Cluster Agent’s status subcommand inside your Cluster Agent container and look for kubernetes_state_core under the Checks section.

Troubleshooting

Timeout errors

By default, the Kubernetes State Metrics Core check waits 10 seconds for a response from the Kubernetes API server. For large clusters, the request may time out, resulting in missing metrics.

You can avoid this by setting the environment variable DD_KUBERNETES_APISERVER_CLIENT_TIMEOUT to a higher value than the default 10 seconds.

Update your datadog-agent.yaml with the following configuration:

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  override:
    clusterAgent:
      env:
        - name: DD_KUBERNETES_APISERVER_CLIENT_TIMEOUT
          value: <value_greater_than_10>

Then apply the new configuration:

kubectl apply -n $DD_NAMESPACE -f datadog-agent.yaml

Update your datadog-values.yaml with the following configuration:

clusterAgent:
  env:
    - name: DD_KUBERNETES_APISERVER_CLIENT_TIMEOUT
      value: <value_greater_than_10>

Then upgrade your Helm chart:

helm upgrade -f datadog-values.yaml <RELEASE_NAME> datadog/datadog

Need help? Contact Datadog support.

Further Reading

Additional helpful documentation, links, and articles: