Autoscaling with Cluster Agent Custom & External Metrics

Overview

Horizontal Pod Autoscaling, introduced in Kubernetes v1.2, allows autoscaling off of basic metrics like CPU, but it requires a resource called metrics-server to run alongside your application. As of Kubernetes v1.6, it is possible to autoscale off of custom metrics.

Custom metrics are user defined and are collected from within the cluster. As of Kubernetes v1.10, support for external metrics was introduced to autoscale off of any metric from outside the cluster that is collected by Datadog.

A user must implement and register the Custom Metrics Server and External Metrics Provider.

As of v1.0.0, the Custom Metrics Server in the Datadog Cluster Agent implements the External Metrics Provider interface for external metrics. This page explains how to set it up and how to autoscale your Kubernetes workload based off of your Datadog metrics.

Setup

Requirements

  1. Kubernetes >v1.10: you must register the External Metrics Provider resource against the API server.
  2. Enable the Kubernetes aggregation layer.

Installation

To enable the external metrics server with your Cluster Agent in Helm, update your datadog-values.yaml file with the following Cluster Agent configuration. After you set clusterAgent.metricsProvider.enabled to true, redeploy your Datadog Helm chart:

clusterAgent:
  enabled: true
  # Enable the metricsProvider to be able to scale based on metrics in Datadog
  metricsProvider:
    # clusterAgent.metricsProvider.enabled
    # Set this to true to enable Metrics Provider
    enabled: true

This automatically updates the necessary RBAC configurations and sets up the corresponding Service and APIService for Kubernetes to use.

To enable the external metrics server with your Cluster Agent managed by the Datadog Operator, first set up the Datadog Operator. Then, set clusterAgent.config.externalMetrics.enabled to true in the DatadogAgent custom resource:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  credentials:
    apiKey: <DATADOG_API_KEY>
clusterAgent:
  config:
    externalMetrics:
      enabled: true
  replicas: 2

The Operator automatically updates the necessary RBAC configurations and sets the corresponding Service and APIService for Kubernetes to use.

Custom metrics server

To enable the Custom Metrics Server, first follow the instructions to set up the Datadog Cluster Agent within your cluster. Once you have verified a successful base deployment, edit your Deployment manifest for the Datadog Cluster Agent with the following steps:

  1. Set DD_EXTERNAL_METRICS_PROVIDER_ENABLED environment variable to true.
  2. Ensure you have both your environment variables DD_APP_KEY and DD_API_KEY set.
  3. Ensure you have your DD_SITE environment variable set to your Datadog site: . It defaults to the US site datadoghq.com.

Register the external metrics provider service

Once the Datadog Cluster Agent is up and running, apply some additional RBAC policies and set up the Service to route the corresponding requests.

  1. Create a Service named datadog-custom-metrics-server, exposing the port 8443 with the following custom-metric-server.yaml manifest:

    kind: Service
    apiVersion: v1
    metadata:
      name: datadog-custom-metrics-server
    spec:
      selector:
        app: datadog-cluster-agent
      ports:
      - protocol: TCP
        port: 8443
        targetPort: 8443
    

    Note: The Cluster Agent by default is expecting these requests over port 8443. However, if your Cluster Agent Deployment has set the environment variable DD_EXTERNAL_METRICS_PROVIDER_PORT to some other port value, change the targetPort of this Service accordingly.

    Apply this Service by running kubectl apply -f custom-metric-server.yaml

  2. Download the rbac-hpa.yaml RBAC rules file.

  3. Register the Cluster Agent as an external metrics provider by applying this file:

    kubectl apply -f rbac-hpa.yaml
    

Usage

Once you have the Datadog Cluster Agent running and the service registered, create an HPA manifest and specify type: External for your metrics in order to notify the HPA to pull the metrics from the Datadog Cluster Agent’s service:

spec:
  metrics:
    - type: External
      external:
        metricName: "<METRIC_NAME>"
        metricSelector:
          matchLabels:
            <TAG_KEY>: <TAG_VALUE>

Example HPAs

An HPA manifest to autoscale off an NGINX deployment based off of the nginx.net.request_per_s Datadog metric using apiVersion: autoscaling/v2beta1:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: nginxext
spec:
  minReplicas: 1
  maxReplicas: 3
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx
  metrics:
  - type: External
    external:
      metricName: nginx.net.request_per_s
      metricSelector:
        matchLabels:
            kube_container_name: nginx
      targetAverageValue: 9

Note: In this manifest:

  • The HPA is configured to autoscale the deployment called nginx.
  • The maximum number of replicas created is 3, and the minimum is 1.
  • The metric used is nginx.net.request_per_s, and the scope is kube_container_name: nginx. This metric format corresponds to the Datadog one.

The following is the same HPA manifest as above using apiVersion: autoscaling/v2:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nginxext
spec:
  minReplicas: 1
  maxReplicas: 3
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx
  metrics:
  - type: External
    external:
      metric:
        name: nginx.net.request_per_s
      target:
        type: AverageValue
        averageValue: 9

Every 30 seconds, Kubernetes queries the Datadog Cluster Agent to get the value of this metric and autoscales proportionally if necessary. For advanced use cases, it is possible to have several metrics in the same HPA. As noted in Kubernetes horizontal pod autoscaling, the largest of the proposed values is the one chosen.

Note: Running multiple Cluster Agents raises API usage. The Datadog Cluster Agent completes 120 calls per hour for approximately 45 HPA objects in Kubernetes. Running more than 45 HPAs increases the number of calls when fetching metrics from within the same org.

Autoscaling

You can autoscale on a Datadog query by using the DatadogMetric Custom Resource Definition (CRD) and Datadog Cluster Agent versions 1.7.0 or above. This is a more flexible approach and allows you to scale with the exact Datadog query you would use in-app.

Requirements

For autoscaling to work correctly, custom queries must follow these rules:

  • The query must be syntactically correct, otherwise it prevents the refresh of ALL metrics used for autoscaling (effectively stopping autoscaling).
  • The query result must output only one series (otherwise, the results are considered invalid).
  • The query should yield at least two timestamped points (it’s possible to use a query that returns a single point, though in this case, autoscaling may use incomplete points).

Note: While the query is arbitrary, the start and end times are still set at Now() - 5 minutes and Now().

Setup

Datadog Cluster Agent

Set up the Datadog Cluster Agent to use DatadogMetric using Helm, the Datadog Operator or Daemonset:

To activate usage of the DatadogMetric CRD update your datadog-values.yaml Helm configuration to set clusterAgent.metricsProvider.useDatadogMetrics to true. Then redeploy your Datadog Helm chart:

clusterAgent:
  enabled: true
  metricsProvider:
    enabled: true
    # clusterAgent.metricsProvider.useDatadogMetrics
    # Enable usage of DatadogMetric CRD to autoscale on arbitrary Datadog queries
    useDatadogMetrics: true

Note: This attempts to install the DatadogMetric CRD automatically. If that CRD already exists prior to the initial Helm installation, it may conflict.

This automatically updates the necessary RBAC files and directs the Cluster Agent to manage these HPA queries through these DatadogMetric resources.

To activate the usage of the DatadogMetric CRD update your DatadogAgent custom resource and set clusterAgent.config.externalMetrics.useDatadogMetrics to true.

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  credentials:
    apiKey: <DATADOG_API_KEY>
clusterAgent:
  config:
    externalMetrics:
      enabled: true
      useDatadogMetrics: true
  replicas: 2

The Operator automatically updates the necessary RBAC configurations and directs the Cluster Agent to manage these HPA queries through these DatadogMetric resources.

To activate usage of the DatadogMetric CRD, follow these extra steps:

  1. Install the DatadogMetric CRD in your cluster.

    kubectl apply -f "https://raw.githubusercontent.com/DataDog/helm-charts/master/crds/datadoghq.com_datadogmetrics.yaml"
    
  2. Update Datadog Cluster Agent RBAC manifest, it has been updated to allow usage of DatadogMetric CRD.

    kubectl apply -f "https://raw.githubusercontent.com/DataDog/datadog-agent/master/Dockerfiles/manifests/cluster-agent-datadogmetrics/cluster-agent-rbac.yaml"
    
  3. Set the DD_EXTERNAL_METRICS_PROVIDER_USE_DATADOGMETRIC_CRD to true in the deployment of the Datadog Cluster Agent.

HPA

Once the Cluster Agent is set up, configure a HPA to use the DatadogMetric object. DatadogMetric is a namespaced resource. While any HPA can reference any DatadogMetric, Datadog recommends creating them in same namespace as your HPA.

Note: Multiple HPAs can use the same DatadogMetric.

You can create a DatadogMetric with the following manifest:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogMetric
metadata:
  name: <your_datadogmetric_name>
spec:
  query: <your_custom_query>
Example DatadogMetric object

A DatadogMetric object to autoscale an NGINX deployment based on the nginx.net.request_per_s Datadog metric:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogMetric
metadata:
  name: nginx-requests
spec:
  query: max:nginx.net.request_per_s{kube_container_name:nginx}.rollup(60)

Once your DatadogMetric is created, you need to configure your HPA to use this DatadogMetric:

spec:
  metrics:
    - type: External
      external:
        metricName: "datadogmetric@<namespace>:<datadogmetric_name>"
Example HPAs

An HPA using the DatadogMetric named nginx-requests, assuming both objects are in namespace nginx-demo.

Using apiVersion: autoscaling/v2beta1:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: nginxext
spec:
  minReplicas: 1
  maxReplicas: 3
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx
  metrics:
  - type: External
    external:
      metricName: datadogmetric@nginx-demo:nginx-requests
      targetAverageValue: 9

Using apiVersion: autoscaling/v2:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nginxext
spec:
  minReplicas: 1
  maxReplicas: 3
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx
  metrics:
  - type: External
    external:
      metric:
        name: datadogmetric@nginx-demo:nginx-requests
      target:
        type: AverageValue
        averageValue: 9

Once you’ve linked your HPA to a DatadogMetric, the Datadog Cluster Agent uses your custom query to provide values to your HPA.

Migration

Existing HPAs are automatically migrated using external metrics.

When you set DD_EXTERNAL_METRICS_PROVIDER_USE_DATADOGMETRIC_CRD to true but you still have HPAs that do not reference a DatadogMetric, normal syntax (without referencing a DatadogMetric through datadogmetric@...) is still supported.

The Datadog Cluster Agent automatically creates DatadogMetric resources in its own namespace (their name starts with dcaautogen-) to accommodate this, it allows a smooth transition to DatadogMetric.

If you choose to migrate an HPA later on to reference a DatadogMetric, the automatically generated resource is cleaned up by the Datadog Cluster Agent after few hours.

Troubleshooting

The Datadog Cluster Agent takes care of updating the status subresource of all DatadogMetric resources to reflect results from queries to Datadog. This is the main source of information to understand what happens if something is failing. You can run the following to get this information outputted for you:

kubectl describe datadogmetric <RESOURCE NAME>

Example

The status part of a DatadogMetric:

status:
  conditions:
  - lastTransitionTime: "2020-06-22T14:38:21Z"
    lastUpdateTime: "2020-06-25T09:21:00Z"
    status: "True"
    type: Active
  - lastTransitionTime: "2020-06-25T09:00:00Z"
    lastUpdateTime: "2020-06-25T09:21:00Z"
    status: "True"
    type: Valid
  - lastTransitionTime: "2020-06-22T14:38:21Z"
    lastUpdateTime: "2020-06-25T09:21:00Z"
    status: "True"
    type: Updated
  - lastTransitionTime: "2020-06-25T09:00:00Z"
    lastUpdateTime: "2020-06-25T09:21:00Z"
    status: "False"
    type: Error
  currentValue: "1977.2"

The four conditions give you insights on the current state of your DatadogMetric:

  • Active: Datadog considers a DatadogMetric active if at least one HPA is referencing it. Inactive DatadogMetrics are not updated to minimize API usage.
  • Valid: Datadog considers a DatadogMetric valid when the answer for the associated query is valid. An invalid status probably means that your custom query is not semantically correct. See the Error field for details.
  • Updated: This condition is always updated when the Datadog Cluster Agent touches a DatadogMetric.
  • Error: If processing this DatadogMetric triggers an error, this condition is true and contains error details.

The currentValue is the value gathered from Datadog and returned to the HPAs.

Further Reading