Kubernetes Autoscaling

Documentos > Monitorización de contenedores > Kubernetes Autoscaling

Esta página aún no está disponible en español. Estamos trabajando en su traducción.
Si tienes alguna pregunta o comentario sobre nuestro actual proyecto de traducción, no dudes en ponerte en contacto con nosotros.

Datadog Kubernetes Autoscaling continuously monitors your Kubernetes resources to provide immediate scaling recommendations and multidimensional autoscaling of your Kubernetes workloads. You can deploy autoscaling through the Datadog web interface, or with a DatadogPodAutoscaler custom resource.

How it works

Datadog uses real-time and historical utilization metrics and event signals from your existing Datadog Agents to make recommendations. You can then examine these recommendations and choose to deploy them.

By default, Datadog Kubernetes Autoscaling uses estimated CPU and memory cost values to show savings opportunities and impact estimates. You can also use Kubernetes Autoscaling alongside Cloud Cost Management to get reporting based on your exact instance type costs.

Automated workload scaling is powered by a DatadogPodAutoscaler custom resource that defines scaling behavior on a per-workload level. The Datadog Cluster Agent acts as the controller for this custom resource.

Each cluster can have a maximum of 1000 workloads optimized with Datadog Kubernetes Autoscaler.

Compatibility

Distributions: This feature is compatible with all of Datadog’s supported Kubernetes distributions.
Workload autoscaling: This feature is an alternative to Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA). Datadog recommends that you remove any HPAs or VPAs from a workload when enabling Datadog Kubernetes Autoscaling to optimize it. These workloads are identified in the application on your behalf. Note: You can experiment with Datadog Kubernetes Autoscaling while keeping your HPA and/or VPA by creating a DatadogPodAutoscaler with mode: Preview in the applyPolicy section.

Requirements

Remote Configuration must be enabled both at the organization level and on the Agents in your target cluster. See Enabling Remote Configuration for setup instructions.
Helm, for updating your Datadog Agent.
(For Datadog Operator users) kubectl CLI, for updating the Datadog Agent.
When you are using live autoscaling, Datadog recommends using the latest Datadog Agent version. This helps ensure access to the latest improvements and optimizations. Scaling recommendations require the Kubernetes State Core integration to be enabled.

Feature Minimum Agent Version
In-app workload scaling recommendations 7.50+
Live workload scaling 7.66.1+
Argo Rollout recommendations and autoscaling 7.71+
Cluster autoscaling (preview sign-up) 7.72+
The following user permissions:
- Org Management (required for Remote Configuration)
- API Keys Write (required for Remote Configuration)
- Workload Scaling Write
- Autoscaling Manage
(Recommended) Linux kernel v5.19+ and cgroup v2

Feature	Minimum Agent Version
In-app workload scaling recommendations	7.50+
Live workload scaling	7.66.1+
Argo Rollout recommendations and autoscaling	7.71+
Cluster autoscaling (preview sign-up)	7.72+

Setup

Ensure you are using Datadog Operator v1.16.0+. To upgrade your Datadog Operator:

helm upgrade datadog-operator datadog/datadog-operator

Add the following to your datadog-agent.yaml configuration file:

spec:
  features:
    autoscaling:
      workload:
        enabled: true
    eventCollection:
      unbundleEvents: true
  override:
    clusterAgent:
      env:
        - name: DD_AUTOSCALING_FAILOVER_ENABLED
          value: "true"
    nodeAgent:
      env:
        - name: DD_AUTOSCALING_FAILOVER_ENABLED
          value: "true"

Admission Controller is enabled by default with the Datadog Operator. If you disabled it, re-enable it by adding the following highlighted lines to datadog-agent.yaml:

...
spec:
  features:
    admissionController:
      enabled: true
...

Apply the updated datadog-agent.yaml configuration:

kubectl apply -n $DD_NAMESPACE -f datadog-agent.yaml

Ensure you are using Agent and Cluster Agent v7.66.1+. Add the following to your datadog-values.yaml configuration file:

datadog:
  autoscaling:
    workload:
      enabled: true
  kubernetesEvents:
    unbundleEvents: true

Admission Controller is enabled by default in the Datadog Helm chart. If you disabled it, re-enable it by adding the following highlighted lines to datadog-values.yaml:

...
clusterAgent:
  admissionController:
    enabled: true
...

Update your Helm version:

helm repo update

Redeploy the Datadog Agent with your updated datadog-values.yaml:

helm upgrade -f datadog-values.yaml <RELEASE_NAME> datadog/datadog

Idle cost and savings estimates

If Cloud Cost Management is enabled within an org, Datadog Kubernetes Autoscaling shows idle cost and savings estimates based on your exact bill cost of underlying monitored instances.

See Cloud Cost setup instructions for AWS, Azure, or Google Cloud.

Cloud Cost Management data enhances Kubernetes Autoscaling, but it is not required. All of Datadog’s workload recommendations and autoscaling decisions are valid and functional without Cloud Cost Management.

If Cloud Cost Management is not enabled, Datadog Kubernetes Autoscaling shows idle cost and savings estimates using the following formulas and fixed values:

Cluster idle:

  (cpu_capacity - max(cpu_usage, cpu_requests)) * core_rate_per_hour
+ (mem_capacity - max(mem_usage, mem_requests)) * memory_rate_per_hour

Workload idle:

  (max(cpu_usage, cpu_requests) - cpu_usage) * core_rate_per_hour
+ (max(mem_usage, mem_requests) - mem_usage) * memory_rate_per_hour

Fixed values:

core_rate_per_hour = $0.0295 per CPU core hour
memory rate_per_hour = $0.0053 per memory GB hour

Fixed cost values are subject to refinement over time.

Usage

Identify resources to rightsize

The Autoscaling Summary page provides a starting point for platform teams to understand the total Kubernetes Resource savings opportunities across an organization, and filter down to key clusters and namespaces. The Cluster Scaling view provides per-cluster information about total idle CPU, total idle memory, and costs. Click on a cluster for detailed information and a table of the cluster’s workloads. If you are an individual application or service owner, you can also filter by your team or service name directly from the Workload Scaling list view.

Click Optimize on any workload to see its scaling recommendation.

Enable Autoscaling for a workload

After you identify a workload to optimize, Datadog recommends inspecting its Scaling Recommendation. You can also click Configure Recommendation to add constraints or adjust target utilization levels.

When you are ready to proceed with enabling Autoscaling for a workload, you have two options for deployment:

Click Enable Autoscaling. (Requires Workload Scaling Write permission.)
Datadog automatically installs and configures autoscaling for this workload on your behalf.
Deploy a DatadogPodAutoscaler custom resource.
Use your existing deploy process to target and configure Autoscaling for your workload. See the example configurations below.

Example DatadogPodAutoscaler configurations

The following examples demonstrate common DatadogPodAutoscaler configurations for different scaling strategies. You can use these as starting points and adjust the values to match your workload’s requirements.

The Optimize Cost profile uses multidimensional scaling to aggressively reduce resource waste. It sets a high CPU utilization target (85%), allows scaling down to a single replica, and uses aggressive scale-down rules for fast response to reduced load.

apiVersion: datadoghq.com/v1alpha2
kind: DatadogPodAutoscaler
metadata:
    name: <WORKLOAD_NAME>
    namespace: <NAMESPACE>
spec:
    targetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: <WORKLOAD_NAME>
    owner: Local
    applyPolicy:
        mode: Apply
        scaleDown:
            rules:
                # Aggressive: allow 50% reduction every 2 minutes
                - periodSeconds: 120
                  type: Percent
                  value: 50
            stabilizationWindowSeconds: 300
        scaleUp:
            rules:
                - periodSeconds: 120
                  type: Percent
                  value: 50
            stabilizationWindowSeconds: 300
        update:
            strategy: Auto
    constraints:
        maxReplicas: 100
        # Allow scaling down to 1 replica for maximum savings
        minReplicas: 1
    objectives:
        # High utilization target to maximize cost efficiency
        - type: PodResource
          podResource:
            name: cpu
            value:
                type: Utilization
                utilization: 85

The Optimize Balance profile provides a middle ground between cost optimization and stability. It uses a moderate CPU utilization target (70%), maintains at least 2 replicas, and applies conservative scale-down rules to avoid disruptive scaling events.

apiVersion: datadoghq.com/v1alpha2
kind: DatadogPodAutoscaler
metadata:
    name: <WORKLOAD_NAME>
    namespace: <NAMESPACE>
spec:
    targetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: <WORKLOAD_NAME>
    owner: Local
    applyPolicy:
        mode: Apply
        scaleDown:
            rules:
                # Conservative: allow only 20% reduction every 20 minutes
                - periodSeconds: 1200
                  type: Percent
                  value: 20
            stabilizationWindowSeconds: 600
        scaleUp:
            rules:
                - periodSeconds: 120
                  type: Percent
                  value: 50
            stabilizationWindowSeconds: 600
        update:
            strategy: Auto
    constraints:
        maxReplicas: 100
        # Maintain at least 2 replicas for availability
        minReplicas: 2
    objectives:
        # Moderate utilization target balances cost and performance
        - type: PodResource
          podResource:
            name: cpu
            value:
                type: Utilization
                utilization: 70

The Vertical only profile scales by adjusting CPU and memory requests and limits on existing pods, without changing the replica count. This is useful for workloads that cannot be horizontally scaled, or where you want to rightsize individual pods.

apiVersion: datadoghq.com/v1alpha2
kind: DatadogPodAutoscaler
metadata:
    name: <WORKLOAD_NAME>
    namespace: <NAMESPACE>
spec:
    targetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: <WORKLOAD_NAME>
    owner: Local
    applyPolicy:
        mode: Apply
        # Horizontal scaling disabled; only vertical resizing
        scaleDown:
            strategy: Disabled
        scaleUp:
            strategy: Disabled
        update:
            strategy: Auto
    constraints:
        maxReplicas: 100

The Horizontal only with Custom Query profile scales replica count based on a custom Datadog metric query instead of CPU or memory utilization. This is useful for workloads where application-level metrics (such as queue depth, request latency, or throughput) are better indicators of scaling need.

apiVersion: datadoghq.com/v1alpha2
kind: DatadogPodAutoscaler
metadata:
    name: <WORKLOAD_NAME>
    namespace: <NAMESPACE>
spec:
    targetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: <WORKLOAD_NAME>
    owner: Local
    applyPolicy:
        mode: Apply
        scaleDown:
            rules:
                - periodSeconds: 1200
                  type: Percent
                  value: 20
            stabilizationWindowSeconds: 600
        scaleUp:
            rules:
                - periodSeconds: 120
                  type: Percent
                  value: 50
            stabilizationWindowSeconds: 600
        # Vertical updates disabled — horizontal only
        update:
            strategy: Disabled
    constraints:
        maxReplicas: 100
        minReplicas: 2
    objectives:
        - type: CustomQuery
          customQuery:
            # Replace with your own Datadog metric query
            request:
                formula: usage
                queries:
                    - name: usage
                      source: Metrics
                      metrics:
                        query: avg:redis.info.latency_ms{kube_cluster_name:<CLUSTER_NAME>,kube_namespace:<NAMESPACE>,kube_deployment:<WORKLOAD_NAME>}
            value:
                type: AbsoluteValue
                absoluteValue: 500M
            window: 5m0s
    fallback:
        horizontal:
            # With custom queries, local fallback is not activated by default
            enabled: false
            # Direction can be ScaleUp, ScaleDown or All
            direction: ScaleUp
            # When using custom queries, a CPU or Memory fallback objective is required
            objectives:
                - type: PodResource
                  podResource:
                    name: cpu
                    value:
                        type: Utilization
                        utilization: 70
            triggers:
                staleRecommendationThresholdSeconds: 600

Deploy recommendations manually

As an alternative to Autoscaling, you can also deploy Datadog’s scaling recommendations manually. When you configure resources for your Kubernetes deployments, use the values suggested in the scaling recommendations. You can also click Export Recommendation to see a generated kubectl patch command.