Logging is here!

Kubernetes

Agent CheckAgent Check
Kubernetes Dashboard

Overview

Get metrics from kubernetes service in real time to:

  • Visualize and monitor kubernetes states
  • Be notified about kubernetes failovers and events.

For Kubernetes, it’s recommended to run the Agent in a DaemonSet. We have created a Docker image with both the Docker and the Kubernetes integrations enabled.

You can also just run the Datadog Agent on your host and configure it to gather your Kubernetes metrics.

Setup Kubernetes

Installation

Container Installation

Thanks to Kubernetes, you can take advantage of DaemonSets to automatically deploy the Datadog Agent on all your nodes (or on specific nodes by using nodeSelectors).

If DaemonSets are not an option for your Kubernetes cluster, install the Datadog agent as a sidecar container on each Kubernetes node.

If your Kubernetes has RBAC enabled, see the documentation on how to configure RBAC permissions with your Datadog-Kubernetes integration.

  • Create the following dd-agent.yaml manifest:
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: dd-agent
spec:
  template:
    metadata:
      labels:
        app: dd-agent
      name: dd-agent
    spec:
      containers:
      - image: datadog/docker-dd-agent:latest
        imagePullPolicy: Always
        name: dd-agent
        ports:
          - containerPort: 8125
            name: dogstatsdport
            protocol: UDP
        env:
          - name: API_KEY
            value: "YOUR_API_KEY"
          - name: KUBERNETES
            value: "yes"
        volumeMounts:
          - name: dockersocket
            mountPath: /var/run/docker.sock
          - name: procdir
            mountPath: /host/proc
            readOnly: true
          - name: cgroups
            mountPath: /host/sys/fs/cgroup
            readOnly: true
      volumes:
        - hostPath:
            path: /var/run/docker.sock
          name: dockersocket
        - hostPath:
            path: /proc
          name: procdir
        - hostPath:
            path: /sys/fs/cgroup
          name: cgroups

Replace YOUR_API_KEY with your api key or use Kubernetes secrets to set your API key as an environement variable.

  • Deploy the DaemonSet with the command: kubectl create -f dd-agent.yaml

Note: This manifest enables autodiscovery’s auto configuration feature. To disable it, remove the SD_BACKEND environment variable definition. To learn how to configure autodiscovery, please refer to its documentation.

Host Installation

Install the dd-check-kubernetes package manually or with your favorite configuration manager.

Configuration

Edit the kubernetes.yaml file to point to your server and port, set the masters to monitor:

instances:
    host: localhost
    port: 4194
    method: http

See the example kubernetes.yaml for all available configuration options.

Validation

Container Running

To verify the Datadog Agent is running in your environment as a daemonset, execute:

kubectl get daemonset

If the Agent is deployed you will see output similar to the text below, where desired and current are equal to the number of nodes running in your cluster.

NAME       DESIRED   CURRENT   NODE-SELECTOR   AGE
dd-agent   3         3         <none>          11h

Agent check running

Run the Agent’s info subcommand and look for kubernetes under the Checks section:

Checks
======

    kubernetes
    -----------
      - instance #0 [OK]
      - Collected 39 metrics, 0 events & 7 service checks

Setup Kubernetes State

Installation

Container Installation

If you are running Kubernetes >= 1.2.0, you can use the kube-state-metrics project to provide additional metrics (identified by the kubernetes_state prefix in the metrics list below) to Datadog.

To run kube-state-metrics, create a kube-state-metrics.yaml file using the following manifest to deploy the kube-state-metrics service:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: kube-state-metrics
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: kube-state-metrics
    spec:
      containers:
      - name: kube-state-metrics
        image: gcr.io/google_containers/kube-state-metrics:v1.2.0
        ports:
        - name: metrics
          containerPort: 8080
        resources:
          requests:
            memory: 30Mi
            cpu: 100m
          limits:
            memory: 50Mi
            cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: 'true'
  labels:
    app: kube-state-metrics
  name: kube-state-metrics
spec:
  ports:
  - name: metrics
    port: 8080
    targetPort: metrics
    protocol: TCP
  selector:
    app: kube-state-metrics

Then deploy it by running:

kubectl create -f kube-state-metrics.yaml

The manifest above uses Google’s publicly available kube-state-metrics container, which is also available on Quay. If you want to build it manually, refer to the official project documentation.

If you configure your Kubernetes State Metrics service to run on a different URL or port, you can configure the Datadog Agent by setting the kube_state_url parameter in conf.d/kubernetes_state.yaml, then restarting the Agent. For more information, see the kubernetes_state.yaml.example file. If you have enabled Autodiscovery, the kube state URL will be configured and managed automatically.

Host Installation

Install the dd-check-kubernetes_state package manually or with your favorite configuration manager (On CentOS/AWS, Find your rpm package here, and information on installation on this page. Then edit the kubernetes_state.yaml file to point to your server and port and set the masters to monitor. See the example kubernetes_state.yaml for all available configuration options.

Validation

Container validation

To verify the Datadog Agent is running in your environment as a daemonset, execute:

kubectl get daemonset

If the Agent is deployed you will see similar output to the text below, where desired and current are equal to the number of running nodes in your cluster.

NAME       DESIRED   CURRENT   NODE-SELECTOR   AGE
dd-agent   3         3         <none>          11h

Agent check validation

Run the Agent’s info subcommand and look for kubernetes_state under the Checks section:

Checks
======

    kubernetes_state
    -----------
      - instance #0 [OK]
      - Collected 39 metrics, 0 events & 7 service checks

Setup Kubernetes DNS

Installation

Install the dd-check-kube_dns package manually or with your favorite configuration manager.

Configuration

Edit the kube_dns.yaml file to point to your server and port, set the masters to monitor. See the sample kube_dns.yaml for all available configuration options.

Using with service discovery

If you are using one dd-agent pod per kubernetes worker node, you could use the following annotations on your kube-dns pod to retrieve the data automatically.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    service-discovery.datadoghq.com/kubedns.check_names: '["kube_dns"]'
    service-discovery.datadoghq.com/kubedns.init_configs: '[{}]'
    service-discovery.datadoghq.com/kubedns.instances: '[[{"prometheus_endpoint":"http://%%host%%:10055/metrics", "tags":["dns-pod:%%host%%"]}]]'

Remarks:

  • Notice the “dns-pod” tag will keep track of the target DNS pod IP. The other tags will be related to the dd-agent that is polling the informations using the service discovery.
  • The service discovery annotations need to be applied to the pod. In case of a deployment, add the annotations to the metadata of the template’s spec.

Validation

Run the Agent’s info subcommand and look for kube_dns under the Checks section:

Checks
======

    kube_dns
    -----------
      - instance #0 [OK]
      - Collected 39 metrics, 0 events & 7 service checks

Data Collected

Metrics

Kubernetes

kubernetes.cpu.capacity
(gauge)
The number of cores in this machine.
shown as
kubernetes.cpu.usage.total
(gauge)
The percentage of CPU time used
shown as percent_nano
kubernetes.cpu.limits
(gauge)
The limit of cpu cores set
shown as cpu
kubernetes.cpu.requests
(gauge)
The requested cpu cores
shown as cpu
kubernetes.filesystem.usage
(gauge)
The amount of disk used
shown as byte
kubernetes.filesystem.usage_pct
(gauge)
The percentage of disk used
shown as fraction
kubernetes.memory.capacity
(gauge)
The amount of memory (in bytes) in this machine
shown as byte
kubernetes.memory.limits
(gauge)
The limit of memory set
shown as byte
kubernetes.memory.requests
(gauge)
The requested memory
shown as byte
kubernetes.memory.usage
(gauge)
The amount of memory used
shown as byte
kubernetes.network.rx_bytes
(gauge)
The amount of bytes per second received
shown as byte
kubernetes.network.tx_bytes
(gauge)
The amount of bytes per second transmitted
shown as byte
kubernetes.network_errors
(gauge)
The amount of network errors per second
shown as error

Kubernetes State

kubernetes_state.container.ready
(gauge)
Whether the containers readiness check succeeded
shown as
kubernetes_state.container.running
(gauge)
Whether the container is currently in running state
shown as
kubernetes_state.container.terminated
(gauge)
Whether the container is currently in terminated state
shown as
kubernetes_state.container.status_report.count.terminated
(count)
Count of the containers currently reporting a in terminated state with the reason as a tag
shown as
kubernetes_state.container.waiting
(gauge)
Whether the container is currently in waiting state
shown as
kubernetes_state.container.status_report.count.waiting
(count)
Count of the containers currently reporting a in waiting state with the reason as a tag
shown as
kubernetes_state.container.gpu.request
(gauge)
0
shown as
kubernetes_state.container.gpu.limit
(gauge)
0
shown as
kubernetes_state.container.restarts
(gauge)
The number of restarts per container
shown as
kubernetes_state.container.cpu_requested
(gauge)
The number of requested cpu cores by a container
shown as cpu
kubernetes_state.container.memory_requested
(gauge)
The number of requested memory bytes by a container
shown as byte
kubernetes_state.container.cpu_limit
(gauge)
The limit on cpu cores to be used by a container
shown as cpu
kubernetes_state.container.memory_limit
(gauge)
The limit on memory to be used by a container
shown as byte
kubernetes_state.daemonset.scheduled
(gauge)
The number of nodes running at least one daemon pod and that are supposed to
shown as
kubernetes_state.daemonset.misscheduled
(gauge)
The number of nodes running a daemon pod but are not supposed to
shown as
kubernetes_state.daemonset.desired
(gauge)
The number of nodes that should be running the daemon pod
shown as
kubernetes_state.daemonset.ready
(gauge)
The number of nodes that should be running the daemon pod and have one or more running and ready
shown as
kubernetes_state.deployment.replicas
(gauge)
The number of replicas per deployment
shown as
kubernetes_state.deployment.replicas_available
(gauge)
The number of available replicas per deployment
shown as
kubernetes_state.deployment.replicas_unavailable
(gauge)
The number of unavailable replicas per deployment
shown as
kubernetes_state.deployment.replicas_updated
(gauge)
The number of updated replicas per deployment
shown as
kubernetes_state.deployment.replicas_desired
(gauge)
The number of desired replicas per deployment
shown as
kubernetes_state.deployment.paused
(gauge)
Whether a deployment is paused
shown as
kubernetes_state.deployment.rollingupdate.max_unavailable
(gauge)
Maximum number of unavailable replicas during a rolling update
shown as
kubernetes_state.job.status.failed
(counter)
Observed number of failed pods in a job
shown as
kubernetes_state.job.status.succeeded
(counter)
Observed number of succeeded pods in a job
shown as
kubernetes_state.limitrange.cpu.min
(gauge)
Minimum CPU request for this type
shown as
kubernetes_state.limitrange.cpu.max
(gauge)
Maximum CPU limit for this type
shown as
kubernetes_state.limitrange.cpu.default
(gauge)
Default CPU limit if not specified
shown as
kubernetes_state.limitrange.cpu.default_request
(gauge)
Default CPU request if not specified
shown as
kubernetes_state.limitrange.cpu.max_limit_request_ratio
(gauge)
Maximum CPU limit / request ratio
shown as
kubernetes_state.limitrange.memory.min
(gauge)
Minimum memory request for this type
shown as
kubernetes_state.limitrange.memory.max
(gauge)
Maximum memory limit for this type
shown as
kubernetes_state.limitrange.memory.default
(gauge)
Default memory limit if not specified
shown as
kubernetes_state.limitrange.memory.default_request
(gauge)
Default memory request if not specified
shown as
kubernetes_state.limitrange.memory.max_limit_request_ratio
(gauge)
Maximum memory limit / request ratio
shown as
kubernetes_state.node.cpu_capacity
(gauge)
The total CPU resources of the node
shown as cpu
kubernetes_state.node.memory_capacity
(gauge)
The total memory resources of the node
shown as byte
kubernetes_state.node.pods_capacity
(gauge)
The total pod resources of the node
shown as
kubernetes_state.node.gpu.cards_allocatable
(gauge)
0
shown as
kubernetes_state.node.gpu.cards_capacity
(gauge)
0
shown as
kubernetes_state.persistentvolumeclaim.status
(gauge)
-1
shown as
kubernetes_state.node.cpu_allocatable
(gauge)
The CPU resources of a node that are available for scheduling
shown as cpu
kubernetes_state.node.memory_allocatable
(gauge)
The memory resources of a node that are available for scheduling
shown as byte
kubernetes_state.node.pods_allocatable
(gauge)
The pod resources of a node that are available for scheduling
shown as
kubernetes_state.node.status
(gauge)
Submitted with a value of 1 for each node and tagged either 'status:schedulable' or 'status:unschedulable'; Sum this metric by either status to get the number of nodes in that status.
shown as
kubernetes_state.hpa.min_replicas
(gauge)
Lower limit for the number of pods that can be set by the autoscaler
shown as
kubernetes_state.hpa.max_replicas
(gauge)
Upper limit for the number of pods that can be set by the autoscaler
shown as
kubernetes_state.hpa.target_cpu
(gauge)
Target CPU percentage of pods managed by this autoscaler
shown as
kubernetes_state.hpa.desired_replicas
(gauge)
Desired number of replicas of pods managed by this autoscaler
shown as
kubernetes_state.pod.ready
(gauge)
Whether the pod is ready to serve requests
shown as
kubernetes_state.pod.scheduled
(gauge)
Reports the status of the scheduling process for the pod with its tags
shown as
kubernetes_state.replicaset.replicas
(gauge)
The number of replicas per ReplicaSet
shown as
kubernetes_state.replicaset.fully_labeled_replicas
(gauge)
The number of fully labeled replicas per ReplicaSet
shown as
kubernetes_state.replicaset.replicas_ready
(gauge)
The number of ready replicas per ReplicaSet
shown as
kubernetes_state.replicaset.replicas_desired
(gauge)
Number of desired pods for a ReplicaSet
shown as
kubernetes_state.replicationcontroller.replicas
(gauge)
The number of replicas per ReplicationController
shown as
kubernetes_state.replicationcontroller.fully_labeled_replicas
(gauge)
The number of fully labeled replicas per ReplicationController
shown as
kubernetes_state.replicationcontroller.replicas_ready
(gauge)
The number of ready replicas per ReplicationController
shown as
kubernetes_state.replicationcontroller.replicas_desired
(gauge)
Number of desired replicas for a ReplicationController
shown as
kubernetes_state.replicationcontroller.replicas_available
(gauge)
The number of available replicas per ReplicationController
shown as
kubernetes_state.resourcequota.pods.used
(gauge)
Observed number of pods used for a resource quota
shown as
kubernetes_state.resourcequota.services.used
(gauge)
Observed number of services used for a resource quota
shown as
kubernetes_state.resourcequota.persistentvolumeclaims.used
(gauge)
Observed number of persistent volume claims used for a resource quota
shown as
kubernetes_state.resourcequota.services.nodeports.used
(gauge)
Observed number of node ports used for a resource quota
shown as
kubernetes_state.resourcequota.services.loadbalancers.used
(gauge)
Observed number of loadbalancers used for a resource quota
shown as
kubernetes_state.resourcequota.requests.cpu.used
(gauge)
Observed sum of CPU cores requested for a resource quota
shown as cpu
kubernetes_state.resourcequota.requests.memory.used
(gauge)
Observed sum of memory bytes requested for a resource quota
shown as byte
kubernetes_state.resourcequota.requests.storage.used
(gauge)
Observed sum of storage bytes requested for a resource quota
shown as byte
kubernetes_state.resourcequota.limits.cpu.used
(gauge)
Observed sum of limits for CPU cores for a resource quota
shown as cpu
kubernetes_state.resourcequota.limits.memory.used
(gauge)
Observed sum of limits for memory bytes for a resource quota
shown as byte
kubernetes_state.resourcequota.pods.limit
(gauge)
Hard limit of the number of pods for a resource quota
shown as
kubernetes_state.resourcequota.services.limit
(gauge)
Hard limit of the number of services for a resource quota
shown as
kubernetes_state.resourcequota.persistentvolumeclaims.limit
(gauge)
Hard limit of the number of PVC for a resource quota
shown as
kubernetes_state.resourcequota.services.nodeports.limit
(gauge)
Hard limit of the number of node ports for a resource quota
shown as
kubernetes_state.resourcequota.services.loadbalancers.limit
(gauge)
Hard limit of the number of loadbalancers for a resource quota
shown as
kubernetes_state.resourcequota.requests.cpu.limit
(gauge)
Hard limit on the total of CPU core requested for a resource quota
shown as cpu
kubernetes_state.resourcequota.requests.memory.limit
(gauge)
Hard limit on the total of memory bytes requested for a resource quota
shown as byte
kubernetes_state.resourcequota.requests.storage.limit
(gauge)
Hard limit on the total of storage bytes requested for a resource quota
shown as byte
kubernetes_state.resourcequota.limits.cpu.limit
(gauge)
Hard limit on the sum of CPU core limits for a resource quota
shown as cpu
kubernetes_state.resourcequota.limits.memory.limit
(gauge)
Hard limit on the sum of memory bytes limits for a resource quota
shown as byte
kubernetes_state.statefulset.replicas
(gauge)
The number of replicas per statefulset
shown as
kubernetes_state.statefulset.replicas_desired
(gauge)
The number of desired replicas per statefulset
shown as
kubernetes_state.statefulset.replicas_current
(gauge)
The number of current replicas per StatefulSet
shown as
kubernetes_state.statefulset.replicas_ready
(gauge)
The number of ready replicas per StatefulSet
shown as
kubernetes_state.statefulset.replicas_updated
(gauge)
The number of updated replicas per StatefulSet
shown as

Kubernetes DNS

kubedns.response_size.bytes.sum
(gauge)
Size of the returns response in bytes.
shown as byte
kubedns.response_size.bytes.count
(gauge)
Number of responses on which the kubedns.response_size.bytes.sum metric is evaluated.
shown as response
kubedns.request_duration.seconds.sum
(gauge)
Time (in seconds) each request took to resolve.
shown as second
kubedns.request_duration.seconds.count
(gauge)
Number of requests on which the kubedns.request_duration.seconds.sum metric is evaluated.
shown as request
kubedns.request_count
(gauge)
Number of DNS requests made.
shown as request
kubedns.error_count
(gauge)
Number of DNS requests resulting in an error.
shown as error
kubedns.cachemiss_count
(gauge)
Number of DNS requests that result in a cache miss.
shown as request

Events

As the 5.17.0 release, Datadog Agent now supports built in leader election option for the Kubernetes event collector. Once enabled, you no longer need to deploy an additional event collection container to your cluster. Instead, Agents will coordinate to ensure only one Agent instance is gathering events at a given time, events below will be available:

  • Backoff
  • Conflict
  • Delete
  • DeletingAllPods
  • Didn’t have enough resource
  • Error
  • Failed
  • FailedCreate
  • FailedDelete
  • FailedMount
  • FailedSync
  • Failedvalidation
  • FreeDiskSpaceFailed
  • HostPortConflict
  • InsufficientFreeCPU
  • InsufficientFreeMemory
  • InvalidDiskCapacity
  • Killing
  • KubeletsetupFailed
  • NodeNotReady
  • NodeoutofDisk
  • OutofDisk
  • Rebooted
  • TerminatedAllPods
  • Unable
  • Unhealthy

Service Checks

The Kubernetes check includes the following service checks:

  • kubernetes.kubelet.check: If CRITICAL, either kubernetes.kubelet.check.ping or kubernetes.kubelet.check.syncloop is in CRITICAL or NO DATA state.

  • kubernetes.kubelet.check.ping: If CRITICAL or NO DATA, Kubelet’s API isn’t available

  • kubernetes.kubelet.check.syncloop: If CRITICAL or NO DATA, Kubelet’s sync loop that updates containers isn’t working.

Troubleshooting

Further Reading

To get a better idea of how (or why) to integrate your Kubernetes service, check out our series of blog posts about it.