Kubernetes Control Plane Monitoring

Kubernetes Control Plane Monitoring

Cette page n'est pas encore disponible en français, sa traduction est en cours.
Si vous avez des questions ou des retours sur notre projet de traduction actuel, n'hésitez pas à nous contacter.

Overview

This section aims to document specificities and to provide good base configurations for monitoring the Kubernetes Control Plane. You can then customize these configurations to add any Datadog feature.

With Datadog integrations for the API server, Etcd, Controller Manager, and Scheduler, you can collect key metrics from all four components of the Kubernetes Control Plane.

Kubernetes with Kubeadm

The following configurations are tested on Kubernetes v1.18+.

API server

The API server integration is automatically configured. The Datadog Agent discovers it automatically.

Etcd

By providing read access to the Etcd certificates located on the host, the Datadog Agent check can communicate with Etcd and start collecting Etcd metrics.

Custom values.yaml:

datadog:
  apiKey: <DATADOG_API_KEY>
  appKey: <DATADOG_APP_KEY>
  clusterName: <CLUSTER_NAME>
  kubelet:
    tlsVerify: false
  ignoreAutoConfig:
  - etcd
  confd:
    etcd.yaml: |-
      ad_identifiers:
        - etcd
      instances:
        - prometheus_url: https://%%host%%:2379/metrics
          tls_ca_cert: /host/etc/kubernetes/pki/etcd/ca.crt
          tls_cert: /host/etc/kubernetes/pki/etcd/server.crt
          tls_private_key: /host/etc/kubernetes/pki/etcd/server.key
agents:
  volumes:
    - hostPath:
        path: /etc/kubernetes/pki/etcd
      name: etcd-certs
  volumeMounts:
    - name: etcd-certs
      mountPath: /host/etc/kubernetes/pki/etcd
      readOnly: true
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists

DatadogAgent Kubernetes Resource:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  credentials:
    apiKey: <DATADOG_API_KEY>
    appKey: <DATADOG_APP_KEY>
  clusterName: <CLUSTER_NAME>
  agent:
    image:
      name: "gcr.io/datadoghq/agent:latest"
    config:
      confd:
        configMapName: datadog-checks
      kubelet:
        tlsVerify: false
      volumes:
        - hostPath:
            path: /etc/kubernetes/pki/etcd
          name: etcd-certs
        - name: disable-etcd-autoconf
          emptyDir: {}
      volumeMounts:
        - name: etcd-certs
          mountPath: /host/etc/kubernetes/pki/etcd
          readOnly: true
        - name: disable-etcd-autoconf
          mountPath: /etc/datadog-agent/conf.d/etcd.d
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
  clusterAgent:
    image:
      name: "gcr.io/datadoghq/cluster-agent:latest"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: datadog-checks
data:
  etcd.yaml: |-
    ad_identifiers:
      - etcd
    init_config:
    instances:
      - prometheus_url: https://%%host%%:2379/metrics
        tls_ca_cert: /host/etc/kubernetes/pki/etcd/ca.crt
        tls_cert: /host/etc/kubernetes/pki/etcd/server.crt
        tls_private_key: /host/etc/kubernetes/pki/etcd/server.key

Controller Manager and Scheduler

Insecure ports

If the insecure ports of your Controller Manager and Scheduler instances are enabled, the Datadog Agent discovers the integrations and starts collecting metrics without any additional configuration.

Secure ports

Secure ports allow authentication and authorization to protect your Control Plane components. The Datadog Agent can collect Controller Manager and Scheduler metrics by targeting their secure ports.

Custom values.yaml:

datadog:
  apiKey: <DATADOG_API_KEY>
  appKey: <DATADOG_APP_KEY>
  clusterName: <CLUSTER_NAME>
  kubelet:
    tlsVerify: false
  ignoreAutoConfig:
  - etcd
  - kube_scheduler
  - kube_controller_manager
  confd:
    etcd.yaml: |-
      ad_identifiers:
        - etcd
      instances:
        - prometheus_url: https://%%host%%:2379/metrics
          tls_ca_cert: /host/etc/kubernetes/pki/etcd/ca.crt
          tls_cert: /host/etc/kubernetes/pki/etcd/server.crt
          tls_private_key: /host/etc/kubernetes/pki/etcd/server.key
    kube_scheduler.yaml: |-
      ad_identifiers:
        - kube-scheduler
      instances:
        - prometheus_url: https://%%host%%:10259/metrics
          ssl_verify: false
          bearer_token_auth: true
    kube_controller_manager.yaml: |-
      ad_identifiers:
        - kube-controller-manager
      instances:
        - prometheus_url: https://%%host%%:10257/metrics
          ssl_verify: false
          bearer_token_auth: true
agents:
  volumes:
    - hostPath:
        path: /etc/kubernetes/pki/etcd
      name: etcd-certs
  volumeMounts:
    - name: etcd-certs
      mountPath: /host/etc/kubernetes/pki/etcd
      readOnly: true
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists

DatadogAgent Kubernetes Resource:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  credentials:
    apiKey: <DATADOG_API_KEY>
    appKey: <DATADOG_APP_KEY>
  clusterName: <CLUSTER_NAME>
  agent:
    image:
      name: "gcr.io/datadoghq/agent:latest"
    config:
      confd:
        configMapName: datadog-checks
      kubelet:
        tlsVerify: false
      volumes:
        - hostPath:
            path: /etc/kubernetes/pki/etcd
          name: etcd-certs
        - name: disable-etcd-autoconf
          emptyDir: {}
        - name: disable-scheduler-autoconf
          emptyDir: {}
        - name: disable-controller-manager-autoconf
          emptyDir: {}
      volumeMounts:
        - name: etcd-certs
          mountPath: /host/etc/kubernetes/pki/etcd
          readOnly: true
        - name: disable-etcd-autoconf
          mountPath: /etc/datadog-agent/conf.d/etcd.d
        - name: disable-scheduler-autoconf
          mountPath: /etc/datadog-agent/conf.d/kube_scheduler.d
        - name: disable-controller-manager-autoconf
          mountPath: /etc/datadog-agent/conf.d/kube_controller_manager.d
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
  clusterAgent:
    image:
      name: "gcr.io/datadoghq/cluster-agent:latest"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: datadog-checks
data:
  etcd.yaml: |-
    ad_identifiers:
      - etcd
    init_config:
    instances:
      - prometheus_url: https://%%host%%:2379/metrics
        tls_ca_cert: /host/etc/kubernetes/pki/etcd/ca.crt
        tls_cert: /host/etc/kubernetes/pki/etcd/server.crt
        tls_private_key: /host/etc/kubernetes/pki/etcd/server.key
  kube_scheduler.yaml: |-
    ad_identifiers:
      - kube-scheduler
    instances:
      - prometheus_url: https://%%host%%:10259/metrics
        ssl_verify: false
        bearer_token_auth: true
  kube_controller_manager.yaml: |-
    ad_identifiers:
      - kube-controller-manager
    instances:
      - prometheus_url: https://%%host%%:10257/metrics
        ssl_verify: false
        bearer_token_auth: true

Notes:

  • The ssl_verify field in the kube_controller_manager and kube_scheduler configuration needs to be set to false when using self-signed certificates.
  • When targeting secure ports, the bind-address option in your Controller Manager and Scheduler configuration must be reachable by the Datadog Agent. Example:
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
controllerManager:
  extraArgs:
    bind-address: 0.0.0.0
scheduler:
  extraArgs:
    bind-address: 0.0.0.0

Kubernetes on Amazon EKS

On Amazon Elastic Kubernetes Service (EKS), API server metrics are exposed. This allows the Datadog Agent to obtain API server metrics using endpoint checks as described in the Kubernetes API server metrics check documentation. To configure the check, add the following annotations to the default/kubernetes service:

annotations:
  ad.datadoghq.com/endpoints.check_names: '["kube_apiserver_metrics"]'
  ad.datadoghq.com/endpoints.init_configs: '[{}]'
  ad.datadoghq.com/endpoints.instances:
    '[{ "prometheus_url": "https://%%host%%:%%port%%/metrics", "bearer_token_auth": "true" }]'

Other control plane components are not exposed in EKS and cannot be monitored.

Kubernetes on OpenShift 4

On OpenShift 4, all control plane components can be monitored using endpoint checks.

Prerequisites

  1. Enable the Datadog Cluster Agent
  2. Enable Cluster checks
  3. Enable Endpoint checks
  4. Ensure that you are logged in with sufficient permissions to edit services and create secrets.

API server

The API server runs behind the service kubernetes in the default namespace. Annotate this service with the kube_apiserver_metrics configuration:

oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.check_names=["kube_apiserver_metrics"]'
oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.init_configs=[{}]'
oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.instances=[{"prometheus_url": "https://%%host%%:%%port%%/metrics", "bearer_token_auth": "true"}]'
oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.resolve=ip'

The last annotation ad.datadoghq.com/endpoints.resolve is needed because the service is in front of static pods. The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners. The nodes they are running on can be identified with:

oc exec -it <datadog cluster agent pod> -n <datadog ns> -- agent clusterchecks

Etcd

Certificates are needed to communicate with the Etcd service, which can be found in the secret kube-etcd-client-certs in the openshift-monitoring namespace. To give the Datadog Agent access to these certificates, first copy them into the same namespace the Datadog Agent is running in:

oc get secret kube-etcd-client-certs -n openshift-monitoring -o yaml | sed 's/namespace: openshift-monitoring/namespace: <datadog agent namespace>/'  | oc create -f -

These certificates should be mounted on the Cluster Check Runner pods by adding the volumes and volumeMounts as below.

Note: Mounts are also included to disable the Etcd check autoconfiguration file packaged with the agent.

...
clusterChecksRunner:
  volumes:
    - name: etcd-certs
      secret:
        secretName: kube-etcd-client-certs
    - name: disable-etcd-autoconf
      emptyDir: {}
  volumeMounts:
    - name: etcd-certs
      mountPath: /host/etc/etcd
      readOnly: true
    - name: disable-etcd-autoconf
      mountPath: /etc/datadog-agent/conf.d/etcd.d
apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
...
spec:
  agent:
    clusterChecksRunner:
      config:
        volumes:
        - name: etcd-certs
          secret:
            secretName: kube-etcd-client-certs
        - name: disable-etcd-autoconf
          emptyDir: {}
        volumeMounts:
        - name: etcd-certs
          mountPath: /etc/etcd-certs
          readOnly: true
        - name: disable-etcd-autoconf
          mountPath: /etc/datadog-agent/conf.d/etcd.d

Then, annotate the service running in front of Etcd:

oc annotate service etcd -n openshift-etcd 'ad.datadoghq.com/endpoints.check_names=["etcd"]'
oc annotate service etcd -n openshift-etcd 'ad.datadoghq.com/endpoints.init_configs=[{}]'
oc annotate service etcd -n openshift-etcd 'ad.datadoghq.com/endpoints.instances=[{"prometheus_url": "https://%%host%%:%%port%%/metrics", "tls_ca_cert": "/etc/etcd-certs/etcd-client-ca.crt", "tls_cert": "/etc/etcd-certs/etcd-client.crt",
      "tls_private_key": "/etc/etcd-certs/etcd-client.key"}]'
oc annotate service etcd -n openshift-etcd 'ad.datadoghq.com/endpoints.resolve=ip'

The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners.

Controller Manager

The Controller Manager runs behind the service kube-controller-manager in the openshift-kube-controller-manager namespace. Annotate the service with the check configuration:

oc annotate service kube-controller-manager -n openshift-kube-controller-manager 'ad.datadoghq.com/endpoints.check_names=["kube_controller_manager"]'
oc annotate service kube-controller-manager -n openshift-kube-controller-manager 'ad.datadoghq.com/endpoints.init_configs=[{}]'
oc annotate service kube-controller-manager -n openshift-kube-controller-manager 'ad.datadoghq.com/endpoints.instances=[{"prometheus_url": "https://%%host%%:%%port%%/metrics", "ssl_verify": "false", "bearer_token_auth": "true"}]'
oc annotate service kube-controller-manager -n openshift-kube-controller-manager 'ad.datadoghq.com/endpoints.resolve=ip'

The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners.

Scheduler

The Scheduler runs behind the service scheduler in the openshift-kube-scheduler namespace. Annotate the service with the check configuration:

oc annotate service scheduler -n openshift-kube-scheduler 'ad.datadoghq.com/endpoints.check_names=["kube_scheduler"]'
oc annotate service scheduler -n openshift-kube-scheduler 'ad.datadoghq.com/endpoints.init_configs=[{}]'
oc annotate service scheduler -n openshift-kube-scheduler 'ad.datadoghq.com/endpoints.instances=[{"prometheus_url": "https://%%host%%:%%port%%/metrics", "ssl_verify": "false", "bearer_token_auth": "true"}]'
oc annotate service scheduler -n openshift-kube-scheduler 'ad.datadoghq.com/endpoints.resolve=ip'

The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners.

Kubernetes on OpenShift 3

On OpenShift 3, all control plane components can be monitored using endpoint checks.

Prerequisites

  1. Enable the Datadog Cluster Agent
  2. Enable Cluster checks
  3. Enable Endpoint checks
  4. Ensure that you are logged in with sufficient permissions to create and edit services.

API server

The API server runs behind the service kubernetes in the default namespace. Annotate this service with the kube_apiserver_metrics configuration:

oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.check_names=["kube_apiserver_metrics"]'
oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.init_configs=[{}]'
oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.instances=[{"prometheus_url": "https://%%host%%:%%port%%/metrics", "bearer_token_auth": "true"}]'
oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.resolve=ip'

The last annotation ad.datadoghq.com/endpoints.resolve is needed because the service is in front of static pods. The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners. The nodes they are running on can be identified with:

oc exec -it <datadog cluster agent pod> -n <datadog ns> -- agent clusterchecks

Etcd

Certificates are needed to communicate with the Etcd service, which are located on the host. These certificates should be mounted on the Cluster Check Runner pods by adding the volumes and volumeMounts as below.

Note: Mounts are also included to disable the Etcd check autoconfiguration file packaged with the agent.

...
clusterChecksRunner:
  volumes:
    - hostPath:
        path: /etc/etcd
      name: etcd-certs
    - name: disable-etcd-autoconf
      emptyDir: {}
  volumeMounts:
    - name: etcd-certs
      mountPath: /host/etc/etcd
      readOnly: true
    - name: disable-etcd-autoconf
      mountPath: /etc/datadog-agent/conf.d/etcd.d
apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
...
spec:
  agent:
    clusterChecksRunner:
      config:
        volumes:
          - name: etcd-certs
            hostPath:
              path: /etc/etcd
          - name: disable-etcd-autoconf
            emptyDir: {}
        volumeMounts:
          - name: etcd-certs
            mountPath: /host/etc/etcd
            readOnly: true
          - name: disable-etcd-autoconf
            mountPath: /etc/datadog-agent/conf.d/etcd.d

Direct edits of this service are not persisted, so make a copy of the Etcd service:

oc get service etcd -n kube-system -o yaml | sed 's/name: etcd/name: etcd-copy/'  | oc create -f -

Annotate the copied service with the check configuration:

oc annotate service etcd-copy -n openshift-etcd 'ad.datadoghq.com/endpoints.check_names=["etcd"]'
oc annotate service etcd-copy -n openshift-etcd 'ad.datadoghq.com/endpoints.init_configs=[{}]'
oc annotate service etcd-copy -n openshift-etcd 'ad.datadoghq.com/endpoints.instances=[{"prometheus_url": "https://%%host%%:%%port%%/metrics", "tls_ca_cert": "/host/etc/etcd/ca/ca.crt", "tls_cert": "/host/etc/etcd/server.crt",
      "tls_private_key": "/host/etc/etcd/server.key"}]'
oc annotate service etcd-copy -n openshift-etcd 'ad.datadoghq.com/endpoints.resolve=ip'

The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners.

Controller Manager and Scheduler

The Controller Manager and Scheduler run behind the same service, kube-controllers in the kube-system namespace. Direct edits of the service are not persisted, so make a copy of the service:

oc get service kube-controllers -n kube-system -o yaml | sed 's/name: kube-controllers/name: kube-controllers-copy/'  | oc create -f -

Annotate the copied service with the check configurations:

oc annotate service kube-controllers-copy -n kube-system 'ad.datadoghq.com/endpoints.check_names=["kube_controller_manager", "kube_scheduler"]'
oc annotate service kube-controllers-copy -n kube-system 'ad.datadoghq.com/endpoints.init_configs=[{}, {}]'
oc annotate service kube-controllers-copy -n kube-system 'ad.datadoghq.com/endpoints.instances=[{ "prometheus_url": "https://%%host%%:%%port%%/metrics",
      "ssl_verify": "false", "bearer_token_auth": "true" }, { "prometheus_url": "https://%%host%%:%%port%%/metrics",
      "ssl_verify": "false", "bearer_token_auth": "true" }]'
oc annotate service kube-controllers-copy -n kube-system 'ad.datadoghq.com/endpoints.resolve=ip'

The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners.

Kubernetes on managed services (AKS, GKE)

On other managed services, such as Azure Kubernetes Service (AKS) and Google Kubernetes Engine (GKE), the user cannot access the control plane components. As a result, it is not possible to run the kube_apiserver, kube_controller_manager, kube_scheduler, and etcd checks in these environments.

Kubernetes on Rancher Kubernetes Engine (v2.5+)

Rancher v2.5 relies on [PushProx][10] to expose control plane metric endpoints, this allows the Datadog Agent to run control plane checks and collect metrics.

Prerequisites

  1. Install the [rancher-monitoring chart][11].
  2. The pushprox daemonsets are deployed with rancher-monitoring and running in the cattle-monitoring-system namespace.

API server

To configure the kube_apiserver_metrics check, add the following annotations to the default/kubernetes service:

annotations:
  ad.datadoghq.com/endpoints.check_names: '["kube_apiserver_metrics"]'
  ad.datadoghq.com/endpoints.init_configs: '[{}]'
  ad.datadoghq.com/endpoints.instances: '[{ "prometheus_url": "https://%%host%%:%%port%%/metrics", "bearer_token_auth": "true" }]'

Add Kubernetes services to configure auto-discovery checks

By adding headless Kubernetes services to define check configurations, the Datadog Agent is able to target the pushprox pods and collect metrics.

Apply rancher-control-plane-services.yaml:

apiVersion: v1
kind: Service
metadata:
  name: pushprox-kube-scheduler-datadog
  namespace: cattle-monitoring-system
  labels:
    component: kube-scheduler
    k8s-app: pushprox-kube-scheduler-client
  annotations:
    ad.datadoghq.com/endpoints.check_names: '["kube_scheduler"]'
    ad.datadoghq.com/endpoints.init_configs: '[{}]'
    ad.datadoghq.com/endpoints.instances: |
      [
        {
          "prometheus_url": "http://%%host%%:10251/metrics"
        }
      ]      
spec:
  clusterIP: None
  selector:
    k8s-app: pushprox-kube-scheduler-client
---
apiVersion: v1
kind: Service
metadata:
  name: pushprox-kube-controller-manager-datadog
  namespace: cattle-monitoring-system
  labels:
    component: kube-controller-manager
    k8s-app: pushprox-kube-controller-manager-client
  annotations:
    ad.datadoghq.com/endpoints.check_names: '["kube_controller_manager"]'
    ad.datadoghq.com/endpoints.init_configs: '[{}]'
    ad.datadoghq.com/endpoints.instances: |
      [
        {
          "prometheus_url": "http://%%host%%:10252/metrics"
        }
      ]      
spec:
  clusterIP: None
  selector:
    k8s-app: pushprox-kube-controller-manager-client
---
apiVersion: v1
kind: Service
metadata:
  name: pushprox-kube-etcd-datadog
  namespace: cattle-monitoring-system
  labels:
    component: kube-etcd
    k8s-app: pushprox-kube-etcd-client
  annotations:
    ad.datadoghq.com/endpoints.check_names: '["etcd"]'
    ad.datadoghq.com/endpoints.init_configs: '[{}]'
    ad.datadoghq.com/endpoints.instances: |
      [
        {
          "prometheus_url": "https://%%host%%:2379/metrics",
          "tls_ca_cert": "/host/opt/rke/etc/kubernetes/ssl/kube-ca.pem",
          "tls_cert": "/host/opt/rke/etc/kubernetes/ssl/kube-etcd-<node-ip>.pem",
          "tls_private_key": "/host/opt/rke/etc/kubernetes/ssl/kube-etcd-<node-ip>.pem"
        }
      ]      
spec:
  clusterIP: None
  selector:
    k8s-app: pushprox-kube-etcd-client

Deploy the Datadog Agent with manifests based on the following configurations:

Custom values.yaml:

datadog:
  apiKey: <DATADOG_API_KEY>
  appKey: <DATADOG_APP_KEY>
  clusterName: <CLUSTER_NAME>
  kubelet:
    tlsVerify: false
agents:
  volumes:
    - hostPath:
        path: /opt/rke/etc/kubernetes/ssl
      name: etcd-certs
  volumeMounts:
    - name: etcd-certs
      mountPath: /host/opt/rke/etc/kubernetes/ssl
      readOnly: true
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/controlplane
    operator: Exists
  - effect: NoExecute
    key: node-role.kubernetes.io/etcd
    operator: Exists

DatadogAgent Kubernetes Resource:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  credentials:
    apiKey: <DATADOG_API_KEY>
    appKey: <DATADOG_APP_KEY>
  clusterName: <CLUSTER_NAME>
  agent:
    config:
      kubelet:
        tlsVerify: false
      volumes:
        - hostPath:
            path: /opt/rke/etc/kubernetes/ssl
          name: etcd-certs
      volumeMounts:
        - name: etcd-certs
          mountPath: /host/opt/rke/etc/kubernetes/ssl
          readOnly: true
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/controlplane
          operator: Exists
        - effect: NoExecute
          key: node-role.kubernetes.io/etcd
          operator: Exists
  clusterAgent:
    config:
      clusterChecksEnabled: true