This feature is not available for the Datadog for Government (US1-FED) site.
Datadog Kubernetes Autoscaling continuously monitors your Kubernetes resources to provide immediate scaling recommendations and multidimensional autoscaling of your Kubernetes workloads. You can deploy autoscaling through the Datadog web interface, or with a DatadogPodAutoscaler custom resource.
How it works
Datadog uses real-time and historical utilization metrics and event signals from your existing Datadog Agents to make recommendations. You can then examine these recommendations and choose to deploy them.
By default, Datadog Kubernetes Autoscaling uses estimated CPU and memory cost values to show savings opportunities and impact estimates. You can also use Kubernetes Autoscaling alongside Cloud Cost Management to get reporting based on your exact instance type costs.
Automated workload scaling is powered by a DatadogPodAutoscaler custom resource that defines scaling behavior on a per-workload level. The Datadog Cluster Agent acts as the controller for this custom resource.
Each cluster can have a maximum of 1000 workloads optimized with Datadog Kubernetes Autoscaler.
Compatibility
- Distributions: This feature is compatible with all of Datadog’s supported Kubernetes distributions.
- Workload autoscaling: This feature is an alternative to Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA). Datadog recommends that you remove any HPAs or VPAs from a workload when enabling Datadog Kubernetes Autoscaling to optimize it. These workloads are identified in the application on your behalf.
Note: You can experiment with Datadog Kubernetes Autoscaling while keeping your HPA and/or VPA by creating a
DatadogPodAutoscaler with mode: Preview in the applyPolicy section.
Requirements
Remote Configuration must be enabled both at the organization level and on the Agents in your target cluster. See Enabling Remote Configuration for setup instructions.
Helm, for updating your Datadog Agent.
(For Datadog Operator users) kubectl CLI, for updating the Datadog Agent.
When you are using live autoscaling, Datadog recommends using the latest Datadog Agent version. This helps ensure access to the latest improvements and optimizations. Scaling recommendations require the Kubernetes State Core integration to be enabled.
| Feature | Minimum Agent Version |
|---|
| In-app workload scaling recommendations | 7.50+ |
| Live workload scaling | 7.66.1+ |
| Argo Rollout recommendations and autoscaling | 7.71+ |
| Cluster autoscaling (preview sign-up) | 7.72+ |
The following user permissions:
- Org Management (required for Remote Configuration)
- API Keys Write (required for Remote Configuration)
- Workload Scaling Write
- Autoscaling Manage
(Recommended) Linux kernel v5.19+ and cgroup v2
Setup
- Ensure you are using Datadog Operator v1.16.0+. To upgrade your Datadog Operator:
helm upgrade datadog-operator datadog/datadog-operator
- Add the following to your
datadog-agent.yaml configuration file:
spec:
features:
autoscaling:
workload:
enabled: true
eventCollection:
unbundleEvents: true
override:
clusterAgent:
env:
- name: DD_AUTOSCALING_FAILOVER_ENABLED
value: "true"
nodeAgent:
env:
- name: DD_AUTOSCALING_FAILOVER_ENABLED
value: "true"
- Admission Controller is enabled by default with the Datadog Operator. If you disabled it, re-enable it by adding the following highlighted lines to
datadog-agent.yaml:
...
spec:
features:
admissionController:
enabled: true
...
- Apply the updated
datadog-agent.yaml configuration:
kubectl apply -n $DD_NAMESPACE -f datadog-agent.yaml
- Ensure you are using Agent and Cluster Agent v7.66.1+. Add the following to your
datadog-values.yaml configuration file:
datadog:
autoscaling:
workload:
enabled: true
kubernetesEvents:
unbundleEvents: true
- Admission Controller is enabled by default in the Datadog Helm chart. If you disabled it, re-enable it by adding the following highlighted lines to
datadog-values.yaml:
...
clusterAgent:
admissionController:
enabled: true
...
- Update your Helm version:
- Redeploy the Datadog Agent with your updated
datadog-values.yaml:
helm upgrade -f datadog-values.yaml <RELEASE_NAME> datadog/datadog
Idle cost and savings estimates
If Cloud Cost Management is enabled within an org, Datadog Kubernetes Autoscaling shows idle cost and savings estimates based on your exact bill cost of underlying monitored instances.
See Cloud Cost setup instructions for AWS, Azure, or Google Cloud.
Cloud Cost Management data enhances Kubernetes Autoscaling, but it is not required. All of Datadog’s workload recommendations and autoscaling decisions are valid and functional without Cloud Cost Management.
If Cloud Cost Management is not enabled, Datadog Kubernetes Autoscaling shows idle cost and savings estimates using the following formulas and fixed values:
Cluster idle:
(cpu_capacity - max(cpu_usage, cpu_requests)) * core_rate_per_hour
+ (mem_capacity - max(mem_usage, mem_requests)) * memory_rate_per_hour
Workload idle:
(max(cpu_usage, cpu_requests) - cpu_usage) * core_rate_per_hour
+ (max(mem_usage, mem_requests) - mem_usage) * memory_rate_per_hour
Fixed values:
- core_rate_per_hour = $0.0295 per CPU core hour
- memory rate_per_hour = $0.0053 per memory GB hour
Fixed cost values are subject to refinement over time.
Usage
Identify resources to rightsize
The Autoscaling Summary page provides a starting point for platform teams to understand the total Kubernetes Resource savings opportunities across an organization, and filter down to key clusters and namespaces. The Cluster Scaling view provides per-cluster information about total idle CPU, total idle memory, and costs. Click on a cluster for detailed information and a table of the cluster’s workloads. If you are an individual application or service owner, you can also filter by your team or service name directly from the Workload Scaling list view.
Click Optimize on any workload to see its scaling recommendation.
Enable Autoscaling for a workload
After you identify a workload to optimize, Datadog recommends inspecting its Scaling Recommendation. You can also click Configure Recommendation to add constraints or adjust target utilization levels.
When you are ready to proceed with enabling Autoscaling for a workload, you have two options for deployment:
Click Enable Autoscaling. (Requires Workload Scaling Write permission.)
Datadog automatically installs and configures autoscaling for this workload on your behalf.
Deploy a DatadogPodAutoscaler custom resource.
Use your existing deploy process to target and configure Autoscaling for your workload. See the example configurations below.
Example DatadogPodAutoscaler configurations
The following examples demonstrate common DatadogPodAutoscaler configurations for different scaling strategies. You can use these as starting points and adjust the values to match your workload’s requirements.
The Optimize Cost profile uses multidimensional scaling to aggressively reduce resource waste. It sets a high CPU utilization target (85%), allows scaling down to a single replica, and uses aggressive scale-down rules for fast response to reduced load.
apiVersion: datadoghq.com/v1alpha2
kind: DatadogPodAutoscaler
metadata:
name: <WORKLOAD_NAME>
namespace: <NAMESPACE>
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: <WORKLOAD_NAME>
owner: Local
applyPolicy:
mode: Apply
scaleDown:
rules:
# Aggressive: allow 50% reduction every 2 minutes
- periodSeconds: 120
type: Percent
value: 50
stabilizationWindowSeconds: 300
strategy: Max
scaleUp:
rules:
- periodSeconds: 120
type: Percent
value: 50
stabilizationWindowSeconds: 300
strategy: Max
update:
strategy: Auto
constraints:
containers:
- enabled: true
name: '*'
maxReplicas: 100
# Allow scaling down to 1 replica for maximum savings
minReplicas: 1
objectives:
# High utilization target to maximize cost efficiency
- type: PodResource
podResource:
name: cpu
value:
type: Utilization
utilization: 85
fallback:
horizontal:
direction: ScaleUp
enabled: true
objectives:
- type: PodResource
podResource:
name: cpu
value:
type: Utilization
utilization: 85
triggers:
staleRecommendationThresholdSeconds: 600
The Optimize Balance profile provides a middle ground between cost optimization and stability. It uses a moderate CPU utilization target (70%), maintains at least 2 replicas, and applies conservative scale-down rules to avoid disruptive scaling events.
apiVersion: datadoghq.com/v1alpha2
kind: DatadogPodAutoscaler
metadata:
name: <WORKLOAD_NAME>
namespace: <NAMESPACE>
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: <WORKLOAD_NAME>
owner: Local
applyPolicy:
mode: Apply
scaleDown:
rules:
# Conservative: allow only 20% reduction every 20 minutes
- periodSeconds: 1200
type: Percent
value: 20
stabilizationWindowSeconds: 600
strategy: Max
scaleUp:
rules:
- periodSeconds: 120
type: Percent
value: 50
stabilizationWindowSeconds: 600
strategy: Max
update:
strategy: Auto
constraints:
containers:
- enabled: true
name: '*'
maxReplicas: 100
# Maintain at least 2 replicas for availability
minReplicas: 2
objectives:
# Moderate utilization target balances cost and performance
- type: PodResource
podResource:
name: cpu
value:
type: Utilization
utilization: 70
fallback:
horizontal:
direction: ScaleUp
enabled: true
objectives:
- type: PodResource
podResource:
name: cpu
value:
type: Utilization
utilization: 70
triggers:
staleRecommendationThresholdSeconds: 600
The Vertical only profile scales by adjusting CPU and memory requests and limits on existing pods, without changing the replica count. This is useful for workloads that cannot be horizontally scaled, or where you want to rightsize individual pods.
apiVersion: datadoghq.com/v1alpha2
kind: DatadogPodAutoscaler
metadata:
name: <WORKLOAD_NAME>
namespace: <NAMESPACE>
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: <WORKLOAD_NAME>
owner: Local
applyPolicy:
mode: Apply
# Horizontal scaling disabled; only vertical resizing
scaleDown:
strategy: Disabled
scaleUp:
strategy: Disabled
update:
strategy: Auto
constraints:
containers:
- enabled: true
name: '*'
maxReplicas: 100
minReplicas: 2
# No objectives defined — Datadog recommends CPU and memory
# values based on observed usage patterns
fallback:
horizontal:
direction: ScaleUp
# Horizontal fallback disabled for vertical-only scaling
enabled: false
objectives:
- type: PodResource
podResource:
name: cpu
value:
type: Utilization
utilization: 70
triggers:
staleRecommendationThresholdSeconds: 600
The Horizontal only with Custom Query profile scales replica count based on a custom Datadog metric query instead of CPU or memory utilization. This is useful for workloads where application-level metrics (such as queue depth, request latency, or throughput) are better indicators of scaling need.
apiVersion: datadoghq.com/v1alpha2
kind: DatadogPodAutoscaler
metadata:
name: <WORKLOAD_NAME>
namespace: <NAMESPACE>
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: <WORKLOAD_NAME>
owner: Local
applyPolicy:
mode: Apply
scaleDown:
rules:
- periodSeconds: 1200
type: Percent
value: 20
stabilizationWindowSeconds: 600
strategy: Min
scaleUp:
rules:
- periodSeconds: 120
type: Percent
value: 50
stabilizationWindowSeconds: 600
strategy: Min
# Vertical updates disabled — horizontal only
update:
strategy: Disabled
constraints:
maxReplicas: 100
minReplicas: 2
objectives:
- type: CustomQuery
customQuery:
# Replace with your own Datadog metric query
request:
formula: usage
queries:
- name: usage
source: Metrics
metrics:
query: avg:redis.info.latency_ms{kube_cluster_name:<CLUSTER_NAME>,kube_namespace:<NAMESPACE>,kube_deployment:<WORKLOAD_NAME>}
value:
type: AbsoluteValue
absoluteValue: 500M
window: 5m0s
fallback:
horizontal:
direction: ScaleUp
enabled: false
objectives:
- type: PodResource
podResource:
name: cpu
value:
type: Utilization
utilization: 70
triggers:
staleRecommendationThresholdSeconds: 600
Deploy recommendations manually
As an alternative to Autoscaling, you can also deploy Datadog’s scaling recommendations manually. When you configure resources for your Kubernetes deployments, use the values suggested in the scaling recommendations. You can also click Export Recommendation to see a generated kubectl patch command.
Further reading
Más enlaces, artículos y documentación útiles: