Kubernetes distributions
Overview
This section aims to document specifics and to provide good base configuration for all major Kubernetes distributions.
These configuration can then be customized to add any Datadog feature.
AWS Elastic Kubernetes Service (EKS)
No specific configuration is required.
If you are using AWS Bottlerocket OS on your nodes, add the following to enable container monitoring (containerd
check):
Custom values.yaml
:
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
criSocketPath: /run/dockershim.sock
env:
- name: DD_AUTOCONFIG_INCLUDE_FEATURES
value: "containerd"
DatadogAgent Kubernetes Resource:
apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
name: datadog
spec:
credentials:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
agent:
config:
criSocket:
criSocketPath: /run/dockershim.sock
clusterAgent:
image:
name: "gcr.io/datadoghq/cluster-agent:latest"
config:
externalMetrics:
enabled: false
admissionController:
enabled: false
Azure Kubernetes Service (AKS)
AKS requires specific configuration for the Kubelet
integration due to AKS certificates setup.
Custom values.yaml
:
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
kubelet:
tlsVerify: false # Required as of Agent 7.35. See Notes.
DatadogAgent Kubernetes Resource:
apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
name: datadog
spec:
credentials:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
agent:
config:
kubelet:
tlsVerify: false # Required as of Agent 7.35. See Notes.
clusterAgent:
image:
name: "gcr.io/datadoghq/cluster-agent:latest"
config:
externalMetrics:
enabled: false
admissionController:
enabled: false
Notes:
As of Agent 7.35, tlsVerify: false
is required because Kubelet certificates in AKS do not have a Subject Alternative Name (SAN) set.
In some setups, DNS resolution for spec.nodeName
inside Pods may not work in AKS. This has been reported on all AKS Windows nodes and when cluster is setup in a Virtual Network using custom DNS on Linux nodes. In this case, removing the agent.config.kubelet.host
field (defaults to status.hostIP
) and using tlsVerify: false
is required. Using the DD_KUBELET_TLS_VERIFY=false
environment variable also resolves this issue. Both of these options deactivate verification of the server certificate.
env:
- name: DD_KUBELET_TLS_VERIFY
value: "false"
Admission Controller functionality on AKS requires configuring the add selectors to prevent an error on reconciling the webhook:
clusterAgent:
env:
- name: "DD_ADMISSION_CONTROLLER_ADD_AKS_SELECTORS"
value: "true"
Google Kubernetes Engine (GKE)
GKE can be configured in two different mode of operation:
- Standard: You manage the cluster’s underlying infrastructure, giving you node configuration flexibility.
- Autopilot: GKE provisions and manages the cluster’s underlying infrastructure, including nodes and node pools, giving you an optimized cluster with a hands-off experience.
Depending on the operation mode of your cluster, the Datadog Agent needs to be configured differently.
Standard
Since Agent 7.26, no specific configuration is required for GKE (whether you run Docker
or containerd
).
Note: When using COS (Container Optimized OS), the eBPF-based OOM Kill
and TCP Queue Length
checks are supported starting from the version 3.0.1 of the Helm chart. To enable these checks, configure the following setting:
datadog.systemProbe.enableDefaultKernelHeadersPaths
to false
.
Autopilot
GKE Autopilot requires some configuration, shown below.
Datadog recommends that you specify resource limits for the Agent container. Autopilot sets a relatively low default limit (50m CPU, 100Mi memory) that may quickly lead the Agent container to OOMKill depending on your environment. If applicable, also specify resource limits for the Trace Agent and Process Agent containers.
Custom values.yaml
:
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
clusterName: <CLUSTER_NAME>
# Enable the new `kubernetes_state_core` check.
kubeStateMetricsCore:
enabled: true
# Avoid deploying kube-state-metrics chart.
# The new `kubernetes_state_core` doesn't require to deploy the kube-state-metrics anymore.
kubeStateMetricsEnabled: false
agents:
containers:
agent:
# resources for the Agent container
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 200m
memory: 256Mi
traceAgent:
# resources for the Trace Agent container
resources:
requests:
cpu: 100m
memory: 200Mi
limits:
cpu: 100m
memory: 200Mi
processAgent:
# resources for the Process Agent container
resources:
requests:
cpu: 100m
memory: 200Mi
limits:
cpu: 100m
memory: 200Mi
providers:
gke:
autopilot: true
Red Hat OpenShift
OpenShift comes with hardened security by default (SELinux, SecurityContextConstraints), thus requiring some specific configuration:
- Create SCC for Node Agent and Cluster Agent
- Specific CRI socket path as OpenShift uses CRI-O container runtime
- Kubelet API certificates may not always be signed by cluster CA
- Tolerations are required to schedule the Node Agent on
master
and infra
nodes - Cluster name should be set as it cannot be retrieved automatically from cloud provider
This configuration supports OpenShift 3.11 and OpenShift 4, but works best with OpenShift 4.
Custom values.yaml
:
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
clusterName: <CLUSTER_NAME>
criSocketPath: /var/run/crio/crio.sock
# Depending on your DNS/SSL setup, it might not be possible to verify the Kubelet cert properly
# If you have proper CA, you can switch it to true
kubelet:
tlsVerify: false
agents:
podSecurity:
securityContextConstraints:
create: true
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/infra
operator: Exists
clusterAgent:
podSecurity:
securityContextConstraints:
create: true
kube-state-metrics:
securityContext:
enabled: false
When using the Datadog Operator in OpenShift, it is recommended that you install it through OperatorHub or RedHat Marketplace.
The configuration below is meant to work with this setup (due to SCC/ServiceAccount setup), when the
Agent is installed in the same namespace as the Datadog Operator.
apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
name: datadog
spec:
credentials:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
clusterName: <CLUSTER_NAME>
agent:
image:
name: "gcr.io/datadoghq/agent:latest"
apm:
enabled: false
process:
enabled: true
processCollectionEnabled: false
log:
enabled: false
systemProbe:
enabled: false
security:
compliance:
enabled: false
runtime:
enabled: false
rbac:
serviceAccountName: datadog-agent-scc
config:
kubelet:
tlsVerify: false
criSocket:
criSocketPath: /var/run/crio/crio.sock
useCriSocketVolume: true
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/infra
operator: Exists
clusterAgent:
image:
name: "gcr.io/datadoghq/cluster-agent:latest"
config:
externalMetrics:
enabled: false
port: 8443
admissionController:
enabled: false
Rancher
Rancher installations are close to vanilla Kubernetes, requiring only some minor configuration:
- Tolerations are required to schedule the Node Agent on
controlplane
and etcd
nodes - Cluster name should be set as it cannot be retrieved automatically from cloud provider
Custom values.yaml
:
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
clusterName: <CLUSTER_NAME>
kubelet:
tlsVerify: false
agents:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/controlplane
operator: Exists
- effect: NoExecute
key: node-role.kubernetes.io/etcd
operator: Exists
DatadogAgent Kubernetes Resource:
apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
name: datadog
spec:
credentials:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
clusterName: <CLUSTER_NAME>
agent:
image:
name: "gcr.io/datadoghq/agent:latest"
apm:
enabled: false
process:
enabled: true
processCollectionEnabled: false
log:
enabled: false
systemProbe:
enabled: false
security:
compliance:
enabled: false
runtime:
enabled: false
config:
kubelet:
tlsVerify: false
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/controlplane
operator: Exists
- effect: NoExecute
key: node-role.kubernetes.io/etcd
operator: Exists
clusterAgent:
image:
name: "gcr.io/datadoghq/cluster-agent:latest"
config:
externalMetrics:
enabled: false
admissionController:
enabled: false
Oracle Container Engine for Kubernetes (OKE)
No specific configuration is required.
To enable container monitoring, add the following (containerd
check):
Custom values.yaml
:
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
criSocketPath: /run/dockershim.sock
env:
- name: DD_AUTOCONFIG_INCLUDE_FEATURES
value: "containerd"
DatadogAgent Kubernetes Resource:
apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
name: datadog
spec:
credentials:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
agent:
config:
criSocket:
criSocketPath: /run/dockershim.sock
clusterAgent:
image:
name: "gcr.io/datadoghq/cluster-agent:latest"
config:
externalMetrics:
enabled: false
admissionController:
enabled: false
More values.yaml
examples can be found in the Helm chart repository
More DatadogAgent
examples can be found in the Datadog Operator repository
vSphere Tanzu Kubernetes Grid (TKG)
TKG requires some small configuration changes, shown below. For example, setting a toleration is required for the controller to schedule the Node Agent on the master
nodes.
Custom values.yaml
:
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
kubelet:
# Set tlsVerify to false since the Kubelet certificates are self-signed
tlsVerify: false
# Disable the `kube-state-metrics` dependency chart installation.
kubeStateMetricsEnabled: false
# Enable the new `kubernetes_state_core` check.
kubeStateMetricsCore:
enabled: true
# Add a toleration so that the agent can be scheduled on the control plane nodes.
agents:
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
DatadogAgent Kubernetes Resource:
apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
name: datadog
spec:
credentials:
apiSecret:
secretName: datadog-secret
keyName: api-key
appSecret:
secretName: datadog-secret
keyName: app-key
features:
# Enable the new `kubernetes_state_core` check.
kubeStateMetricsCore:
enabled: true
agent:
config:
kubelet:
# Set tlsVerify to false since the Kubelet certificates are self-signed
tlsVerify: false
# Add a toleration so that the agent can be scheduled on the control plane nodes.
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
clusterAgent:
config:
collectEvents: true
Additional helpful documentation, links, and articles: