Logging is here!

Kubernetes

Agent Check Agent Check
Kubernetes Dashboard

Overview

Get metrics and logs from kubernetes service in real time to:

  • Visualize and monitor kubernetes states
  • Be notified about kubernetes failovers and events.

For Kubernetes, it’s recommended to run the Agent in a DaemonSet. We have created a Docker image with both the Docker and the Kubernetes integrations enabled.

You can also just run the Datadog Agent on your host and configure it to gather your Kubernetes metrics.

Setup Kubernetes

Refer to the dedicated documentation to choose the perfect setup for your Kubernetes integration

Setup Kubernetes State

Installation

To gather your kube-state metrics:

  1. Download the Kube-State manifests folder.

  2. Apply them to your Kubernetes cluster:

    kubectl apply -f <NAME_OF_THE_KUBE_STATE_MANIFESTS_FOLDER>
    

Setup Kubernetes DNS

Configuration

Since Agent v6, Kubernetes DNS integration works automatically with the Autodiscovery.

  • Please note that these metrics are unavailable for Azure Kubernetes Service (AKS) at this point in time.

Collect container logs

Available for Agent >6.0

Two installations are possible:

  • On the node where the agent is external to the Docker environment
  • Deployed with its containerized version in the Docker environment

Take advantage of DaemonSets to automatically deploy the Datadog Agent on all your nodes. Otherwise follow the container log collection steps to start collecting logs from all your containers.

Data Collected

Metrics

Kubernetes

kubernetes.cpu.capacity
(gauge)
The number of cores in this machine
shown as core
kubernetes.cpu.usage.total
(gauge)
The number of cores used
shown as nanocore
kubernetes.cpu.limits
(gauge)
The limit of cpu cores set
shown as core
kubernetes.cpu.requests
(gauge)
The requested cpu cores
shown as core
kubernetes.filesystem.usage
(gauge)
The amount of disk used
shown as byte
kubernetes.filesystem.usage_pct
(gauge)
The percentage of disk used
shown as fraction
kubernetes.memory.capacity
(gauge)
The amount of memory (in bytes) in this machine
shown as byte
kubernetes.memory.limits
(gauge)
The limit of memory set
shown as byte
kubernetes.memory.requests
(gauge)
The requested memory
shown as byte
kubernetes.memory.usage
(gauge)
The amount of memory used
shown as byte
kubernetes.network.rx_bytes
(gauge)
The amount of bytes per second received
shown as byte
kubernetes.network.tx_bytes
(gauge)
The amount of bytes per second transmitted
shown as byte
kubernetes.network_errors
(gauge)
The amount of network errors per second
shown as error
kubernetes.diskio.io_service_bytes.stats.total
(gauge)
The amount of disk space the container uses.
shown as byte

Kubelet

kubernetes.cpu.capacity
(gauge)
The number of cores in this machine
shown as core
kubernetes.cpu.usage.total
(gauge)
The number of cores used
shown as nanocore
kubernetes.cpu.limits
(gauge)
The limit of cpu cores set
shown as core
kubernetes.cpu.requests
(gauge)
The requested cpu cores
shown as core
kubernetes.filesystem.usage
(gauge)
The amount of disk used
shown as byte
kubernetes.filesystem.usage_pct
(gauge)
The percentage of disk used
shown as fraction
kubernetes.io.read_bytes
(gauge)
The amount of bytes read from the disk
shown as byte
kubernetes.io.write_bytes
(gauge)
The amount of bytes written to the disk
shown as byte
kubernetes.memory.capacity
(gauge)
The amount of memory (in bytes) in this machine
shown as byte
kubernetes.memory.limits
(gauge)
The limit of memory set
shown as byte
kubernetes.memory.requests
(gauge)
The requested memory
shown as byte
kubernetes.memory.usage
(gauge)
The amount of memory used
shown as byte
kubernetes.memory.usage_pct
(gauge)
The percentage of memory used
shown as fraction
kubernetes.network.rx_bytes
(gauge)
The amount of bytes per second received
shown as byte
kubernetes.network.rx_dropped
(gauge)
The amount of rx packets dropped per second
shown as packet
kubernetes.network.rx_errors
(gauge)
The amount of rx errors per second
shown as error
kubernetes.network.tx_bytes
(gauge)
The amount of bytes per second transmitted
shown as byte
kubernetes.network.tx_dropped
(gauge)
The amount of tx packets dropped per second
shown as packet
kubernetes.network.tx_errors
(gauge)
The amount of tx errors per second
shown as error
kubernetes.diskio.io_service_bytes.stats.total
(gauge)
The amount of disk space the container uses
shown as byte
kubernetes.apiserver.certificate.expiration.count
(gauge)
The count of remaining lifetime on the certificate used to authenticate a request
shown as second
kubernetes.apiserver.certificate.expiration.sum
(gauge)
The sum of remaining lifetime on the certificate used to authenticate a request
shown as second
kubernetes.rest.client.requests
(gauge)
The number of HTTP requests
shown as operation
kubernetes.kubelet.runtime.operations
(gauge)
The number of runtime operations
shown as operation
kubernetes.kubelet.runtime.errors
(gauge)
The number of runtime operations errors
shown as operation

Kubernetes State

kubernetes_state.container.ready
(gauge)
Whether the containers readiness check succeeded
kubernetes_state.container.running
(gauge)
Whether the container is currently in running state
kubernetes_state.container.terminated
(gauge)
Whether the container is currently in terminated state
kubernetes_state.container.status_report.count.terminated
(count)
Count of the containers currently reporting a in terminated state with the reason as a tag
kubernetes_state.container.waiting
(gauge)
Whether the container is currently in waiting state
kubernetes_state.container.status_report.count.waiting
(count)
Count of the containers currently reporting a in waiting state with the reason as a tag
kubernetes_state.container.gpu.request
(gauge)
The number of requested gpu devices by a container
kubernetes_state.container.gpu.limit
(gauge)
The limit on gpu devices to be used by a container
kubernetes_state.container.restarts
(gauge)
The number of restarts per container
kubernetes_state.container.cpu_requested
(gauge)
The number of requested cpu cores by a container
shown as cpu
kubernetes_state.container.memory_requested
(gauge)
The number of requested memory bytes by a container
shown as byte
kubernetes_state.container.cpu_limit
(gauge)
The limit on cpu cores to be used by a container
shown as cpu
kubernetes_state.container.memory_limit
(gauge)
The limit on memory to be used by a container
shown as byte
kubernetes_state.daemonset.scheduled
(gauge)
The number of nodes running at least one daemon pod and that are supposed to
kubernetes_state.daemonset.misscheduled
(gauge)
The number of nodes running a daemon pod but are not supposed to
kubernetes_state.daemonset.desired
(gauge)
The number of nodes that should be running the daemon pod
kubernetes_state.daemonset.ready
(gauge)
The number of nodes that should be running the daemon pod and have one or more running and ready
kubernetes_state.deployment.replicas
(gauge)
The number of replicas per deployment
kubernetes_state.deployment.replicas_available
(gauge)
The number of available replicas per deployment
kubernetes_state.deployment.replicas_unavailable
(gauge)
The number of unavailable replicas per deployment
kubernetes_state.deployment.replicas_updated
(gauge)
The number of updated replicas per deployment
kubernetes_state.deployment.replicas_desired
(gauge)
The number of desired replicas per deployment
kubernetes_state.deployment.paused
(gauge)
Whether a deployment is paused
kubernetes_state.deployment.rollingupdate.max_unavailable
(gauge)
Maximum number of unavailable replicas during a rolling update
kubernetes_state.job.status.failed
(counter)
Observed number of failed pods in a job
kubernetes_state.job.status.succeeded
(counter)
Observed number of succeeded pods in a job
kubernetes_state.limitrange.cpu.min
(gauge)
Minimum CPU request for this type
kubernetes_state.limitrange.cpu.max
(gauge)
Maximum CPU limit for this type
kubernetes_state.limitrange.cpu.default
(gauge)
Default CPU limit if not specified
kubernetes_state.limitrange.cpu.default_request
(gauge)
Default CPU request if not specified
kubernetes_state.limitrange.cpu.max_limit_request_ratio
(gauge)
Maximum CPU limit / request ratio
kubernetes_state.limitrange.memory.min
(gauge)
Minimum memory request for this type
kubernetes_state.limitrange.memory.max
(gauge)
Maximum memory limit for this type
kubernetes_state.limitrange.memory.default
(gauge)
Default memory limit if not specified
kubernetes_state.limitrange.memory.default_request
(gauge)
Default memory request if not specified
kubernetes_state.limitrange.memory.max_limit_request_ratio
(gauge)
Maximum memory limit / request ratio
kubernetes_state.node.cpu_capacity
(gauge)
The total CPU resources of the node
shown as cpu
kubernetes_state.node.memory_capacity
(gauge)
The total memory resources of the node
shown as byte
kubernetes_state.node.pods_capacity
(gauge)
The total pod resources of the node
kubernetes_state.node.gpu.cards_allocatable
(gauge)
The GPU resources of a node that are available for scheduling
kubernetes_state.node.gpu.cards_capacity
(gauge)
The total GPU resources of the node
kubernetes_state.persistentvolumeclaim.status
(gauge)
The phase the persistent volume claim is currently in
kubernetes_state.node.cpu_allocatable
(gauge)
The CPU resources of a node that are available for scheduling
shown as cpu
kubernetes_state.node.memory_allocatable
(gauge)
The memory resources of a node that are available for scheduling
shown as byte
kubernetes_state.node.pods_allocatable
(gauge)
The pod resources of a node that are available for scheduling
kubernetes_state.node.status
(gauge)
Submitted with a value of 1 for each node and tagged either 'status:schedulable' or 'status:unschedulable'; Sum this metric by either status to get the number of nodes in that status.
kubernetes_state.nodes.by_condition
(gauge)
To sum by `condition` and `status` to get number of nodes in a given condition.
kubernetes_state.hpa.min_replicas
(gauge)
Lower limit for the number of pods that can be set by the autoscaler
kubernetes_state.hpa.max_replicas
(gauge)
Upper limit for the number of pods that can be set by the autoscaler
kubernetes_state.hpa.target_cpu
(gauge)
Target CPU percentage of pods managed by this autoscaler
kubernetes_state.hpa.desired_replicas
(gauge)
Desired number of replicas of pods managed by this autoscaler
kubernetes_state.pod.ready
(gauge)
In association with the `condition` tag, whether the pod is ready to serve requests, e.g. `condition:true` keeps the pods that are in a ready state
kubernetes_state.pod.scheduled
(gauge)
Reports the status of the scheduling process for the pod with its tags
kubernetes_state.replicaset.replicas
(gauge)
The number of replicas per ReplicaSet
kubernetes_state.replicaset.fully_labeled_replicas
(gauge)
The number of fully labeled replicas per ReplicaSet
kubernetes_state.replicaset.replicas_ready
(gauge)
The number of ready replicas per ReplicaSet
kubernetes_state.replicaset.replicas_desired
(gauge)
Number of desired pods for a ReplicaSet
kubernetes_state.replicationcontroller.replicas
(gauge)
The number of replicas per ReplicationController
kubernetes_state.replicationcontroller.fully_labeled_replicas
(gauge)
The number of fully labeled replicas per ReplicationController
kubernetes_state.replicationcontroller.replicas_ready
(gauge)
The number of ready replicas per ReplicationController
kubernetes_state.replicationcontroller.replicas_desired
(gauge)
Number of desired replicas for a ReplicationController
kubernetes_state.replicationcontroller.replicas_available
(gauge)
The number of available replicas per ReplicationController
kubernetes_state.resourcequota.pods.used
(gauge)
Observed number of pods used for a resource quota
kubernetes_state.resourcequota.services.used
(gauge)
Observed number of services used for a resource quota
kubernetes_state.resourcequota.persistentvolumeclaims.used
(gauge)
Observed number of persistent volume claims used for a resource quota
kubernetes_state.resourcequota.services.nodeports.used
(gauge)
Observed number of node ports used for a resource quota
kubernetes_state.resourcequota.services.loadbalancers.used
(gauge)
Observed number of loadbalancers used for a resource quota
kubernetes_state.resourcequota.requests.cpu.used
(gauge)
Observed sum of CPU cores requested for a resource quota
shown as cpu
kubernetes_state.resourcequota.requests.memory.used
(gauge)
Observed sum of memory bytes requested for a resource quota
shown as byte
kubernetes_state.resourcequota.requests.storage.used
(gauge)
Observed sum of storage bytes requested for a resource quota
shown as byte
kubernetes_state.resourcequota.limits.cpu.used
(gauge)
Observed sum of limits for CPU cores for a resource quota
shown as cpu
kubernetes_state.resourcequota.limits.memory.used
(gauge)
Observed sum of limits for memory bytes for a resource quota
shown as byte
kubernetes_state.resourcequota.pods.limit
(gauge)
Hard limit of the number of pods for a resource quota
kubernetes_state.resourcequota.services.limit
(gauge)
Hard limit of the number of services for a resource quota
kubernetes_state.resourcequota.persistentvolumeclaims.limit
(gauge)
Hard limit of the number of PVC for a resource quota
kubernetes_state.resourcequota.services.nodeports.limit
(gauge)
Hard limit of the number of node ports for a resource quota
kubernetes_state.resourcequota.services.loadbalancers.limit
(gauge)
Hard limit of the number of loadbalancers for a resource quota
kubernetes_state.resourcequota.requests.cpu.limit
(gauge)
Hard limit on the total of CPU core requested for a resource quota
shown as cpu
kubernetes_state.resourcequota.requests.memory.limit
(gauge)
Hard limit on the total of memory bytes requested for a resource quota
shown as byte
kubernetes_state.resourcequota.requests.storage.limit
(gauge)
Hard limit on the total of storage bytes requested for a resource quota
shown as byte
kubernetes_state.resourcequota.limits.cpu.limit
(gauge)
Hard limit on the sum of CPU core limits for a resource quota
shown as cpu
kubernetes_state.resourcequota.limits.memory.limit
(gauge)
Hard limit on the sum of memory bytes limits for a resource quota
shown as byte
kubernetes_state.statefulset.replicas
(gauge)
The number of replicas per statefulset
kubernetes_state.statefulset.replicas_desired
(gauge)
The number of desired replicas per statefulset
kubernetes_state.statefulset.replicas_current
(gauge)
The number of current replicas per StatefulSet
kubernetes_state.statefulset.replicas_ready
(gauge)
The number of ready replicas per StatefulSet
kubernetes_state.statefulset.replicas_updated
(gauge)
The number of updated replicas per StatefulSet

Kubernetes DNS

kubedns.response_size.bytes.sum
(gauge)
Size of the returns response in bytes.
shown as byte
kubedns.response_size.bytes.count
(gauge)
Number of responses on which the kubedns.response_size.bytes.sum metric is evaluated.
shown as response
kubedns.request_duration.seconds.sum
(gauge)
Time (in seconds) each request took to resolve.
shown as second
kubedns.request_duration.seconds.count
(gauge)
Number of requests on which the kubedns.request_duration.seconds.sum metric is evaluated.
shown as request
kubedns.request_count
(gauge)
Total number of DNS requests made.
shown as request
kubedns.request_count.count
(count)
Instant number of DNS requests made.
shown as request
kubedns.error_count
(gauge)
Number of DNS requests resulting in an error.
shown as error
kubedns.error_count.count
(count)
Instant number of DNS requests made resulting in an error.
shown as error
kubedns.cachemiss_count
(gauge)
Number of DNS requests resulting in a cache miss.
shown as request
kubedns.cachemiss_count.count
(count)
Instant number of DNS requests made resulting in a cache miss.
shown as request

Events

As the 5.17.0 release, Datadog Agent now supports built in leader election option for the Kubernetes event collector. Once enabled, you no longer need to deploy an additional event collection container to your cluster. Instead, Agents will coordinate to ensure only one Agent instance is gathering events at a given time, events below will be available:

  • Backoff
  • Conflict
  • Delete
  • DeletingAllPods
  • Didn’t have enough resource
  • Error
  • Failed
  • FailedCreate
  • FailedDelete
  • FailedMount
  • FailedSync
  • Failedvalidation
  • FreeDiskSpaceFailed
  • HostPortConflict
  • InsufficientFreeCPU
  • InsufficientFreeMemory
  • InvalidDiskCapacity
  • Killing
  • KubeletsetupFailed
  • NodeNotReady
  • NodeoutofDisk
  • OutofDisk
  • Rebooted
  • TerminatedAllPods
  • Unable
  • Unhealthy

Service Checks

The Kubernetes check includes the following service checks:

  • kubernetes.kubelet.check: If CRITICAL, either kubernetes.kubelet.check.ping or kubernetes.kubelet.check.syncloop is in CRITICAL or NO DATA state.

  • kubernetes.kubelet.check.ping: If CRITICAL or NO DATA, Kubelet’s API isn’t available

  • kubernetes.kubelet.check.syncloop: If CRITICAL or NO DATA, Kubelet’s sync loop that updates containers isn’t working.

Troubleshooting

Further Reading

To get a better idea of how (or why) to integrate your Kubernetes service, check out our series of blog posts about it.