---
title: Kubernetes
description: >-
  Capture Pod scheduling events, track the status of your Kubelets, and much
  more.
breadcrumbs: Docs > Integrations > Kubernetes
---

# Kubernetes
Supported OS Integration version1.7.0


## Overview{% #overview %}

Get metrics from the Kubernetes service in real time to:

- Visualize and monitor Kubernetes states
- Be notified about Kubernetes failovers and events.

Note: This check only works with Agent v5. For Agent v6+, see the [kubelet check](https://docs.datadoghq.com/integrations/kubelet.md).

**Minimum Agent version:** 6.0.0

## Setup{% #setup %}

### Installation{% #installation %}

The Kubernetes check is included in the [Datadog Agent](https://app.datadoghq.com/account/settings/agent/latest) package, so you don't need to install anything else on your Kubernetes servers.

For more information on installing the Datadog Agent on your Kubernetes clusters, see the [Kubernetes documentation](https://docs.datadoghq.com/agent/kubernetes.md).

To collect Kubernetes State metrics, see the [kubernetes_state integration](https://docs.datadoghq.com/integrations/kubernetes.md#kubernetes-state-metrics).

### Configuration{% #configuration %}

Edit the `kubernetes.yaml` file to point to your server and port, set the masters to monitor.

### Validation{% #validation %}

Run the [Agent's status subcommand](https://docs.datadoghq.com/agent/guide/agent-commands.md#agent-status-and-information) and look for `kubernetes` under the Checks section.

## Data Collected{% #data-collected %}

### Metrics{% #metrics %}

|  |
|  |
| **kubernetes.cpu.capacity**(gauge)                          | The number of cores in this machine*Shown as core*                                 |
| **kubernetes.cpu.limits**(gauge)                            | The limit of cpu cores set*Shown as core*                                          |
| **kubernetes.cpu.requests**(gauge)                          | The requested cpu cores*Shown as core*                                             |
| **kubernetes.cpu.usage.total**(gauge)                       | The number of cores used*Shown as nanocore*                                        |
| **kubernetes.diskio.io\_service\_bytes.stats.total**(gauge) | The amount of disk space the container uses.*Shown as byte*                        |
| **kubernetes.filesystem.usage**(gauge)                      | The amount of disk used. Requires Docker container runtime.*Shown as byte*         |
| **kubernetes.filesystem.usage\_pct**(gauge)                 | The percentage of disk used. Requires Docker container runtime.*Shown as fraction* |
| **kubernetes.memory.capacity**(gauge)                       | The amount of memory (in bytes) in this machine*Shown as byte*                     |
| **kubernetes.memory.limits**(gauge)                         | The limit of memory set*Shown as byte*                                             |
| **kubernetes.memory.requests**(gauge)                       | The requested memory*Shown as byte*                                                |
| **kubernetes.memory.usage**(gauge)                          | The amount of memory used*Shown as byte*                                           |
| **kubernetes.network.rx\_bytes**(gauge)                     | The amount of bytes per second received*Shown as byte*                             |
| **kubernetes.network.tx\_bytes**(gauge)                     | The amount of bytes per second transmitted*Shown as byte*                          |
| **kubernetes.network\_errors**(gauge)                       | The amount of network errors per second*Shown as error*                            |

### Events{% #events %}

As of the v5.17.0 release, the Datadog Agent supports a built-in leader election option for the Kubernetes event collector. Once enabled, you no longer need to deploy an additional event collection container to your cluster. Instead, Agents coordinate to ensure only one Agent instance is gathering events at a given time, events below are available:

- Backoff
- Conflict
- Delete
- DeletingAllPods
- Didn't have enough resource
- Error
- Failed
- FailedCreate
- FailedDelete
- FailedMount
- FailedSync
- Failedvalidation
- FreeDiskSpaceFailed
- HostPortConflict
- InsufficientFreeCPU
- InsufficientFreeMemory
- InvalidDiskCapacity
- Killing
- KubeletsetupFailed
- NodeNotReady
- NodeoutofDisk
- OutofDisk
- Rebooted
- TerminatedAllPods
- Unable
- Unhealthy

### Service Checks{% #service-checks %}

**kubernetes\_state.node.ready**

Returns `CRITICAL` if a cluster node is not ready. Returns `WARNING` if status is unknown. Returns `OK` otherwise.

*Statuses: ok, warning, critical*

**kubernetes\_state.node.out\_of\_disk**

Returns `CRITICAL` if a cluster node is out of disk space. Returns `UNKNOWN` if status is unknown. Returns `OK` otherwise.

*Statuses: ok, unknown, critical*

**kubernetes\_state.node.disk\_pressure**

Returns `CRITICAL` if a cluster node is in a disk pressure state. Returns `UNKNOWN` if status is unknown. Returns `OK` otherwise.

*Statuses: ok, unknown, critical*

**kubernetes\_state.node.memory\_pressure**

Returns `CRITICAL` if a cluster node is in a memory pressure state. Returns `UNKNOWN` if status is unknown. Returns `OK` otherwise.

*Statuses: ok, unknown, critical*

**kubernetes\_state.node.network\_unavailable**

Returns `CRITICAL` if a cluster node is in a network unavailable state. Returns `UNKNOWN` if status is unknown. Returns `OK` otherwise.

*Statuses: ok, unknown, critical*

**kubernetes\_state.cronjob.on\_schedule\_check**

Returns `CRITICAL` if a cron job scheduled time is unknown or in the past. Returns `OK` otherwise.

*Statuses: ok, critical*

**kubernetes\_state.job.complete**

Returns `CRITICAL` if a job failed. Returns `OK` otherwise.

*Statuses: ok, critical*

**kubernetes\_state.cronjob.complete**

Returns `CRITICAL` if the last job of a cronjob failed. Returns `OK` otherwise.

*Statuses: ok, critical*

## Troubleshooting{% #troubleshooting %}

### Agent installation on Kubernetes master nodes{% #agent-installation-on-kubernetes-master-nodes %}

Since Kubernetes v1.6, the concept of [Taints and tolerations](https://blog.kubernetes.io/2017/03/advanced-scheduling-in-kubernetes.html) was introduced. The master node is no longer off limits, it's simply tainted. Add the required toleration to the pod to run it.

Add the following lines to your Deployment (or Daemonset if you are running a multi-master setup):

```yaml
spec:
  tolerations:
    - key: node-role.kubernetes.io/master
      effect: NoSchedule
```

### Why is the Kubernetes check failing with a ConnectTimeout error to port 10250?{% #why-is-the-kubernetes-check-failing-with-a-connecttimeout-error-to-port-10250 %}

The Agent assumes the kubelet API is available at the default gateway of the container. If that's not the case because you are using a software defined networks like Calico or Flannel, the Agent needs to be specified using an environment variable:

```yaml
- name: KUBERNETES_KUBELET_HOST
  valueFrom:
    fieldRef:
      fieldPath: spec.nodeName
```

For reference, see this [pull request](https://github.com/DataDog/dd-agent/pull/3051).

### Why is there a container in each Kubernetes pod with 0% CPU and minimal disk/ram?{% #why-is-there-a-container-in-each-kubernetes-pod-with-0-cpu-and-minimal-diskram %}

These are pause containers (`docker_image:gcr.io/google_containers/pause.*`) that K8s injects into every pod to keep it populated even if the "real" container is restarting or stopped.

The docker_daemon check ignores them through a default exclusion list, but they do show up for K8s metrics like `kubernetes.cpu.usage.total` and `kubernetes.filesystem.usage`.

## Further Reading{% #further-reading %}

- [Monitoring in the Kubernetes era](https://www.datadoghq.com/blog/monitoring-kubernetes-era)
