The datadog-ci deployment gate
command includes all the required logic to evaluate Deployment Gates in a single command:
datadog-ci deployment gate --service transaction-backend --env staging
If the Deployment Gate being evaluated has APM Faulty Deployment Detection rules, you must also specify the version, for example --version 1.0.1
.
The command has the following characteristics:
- It sends a request to start the evaluation and polls the evaluation status endpoint using the evaluation_id until the evaluation is complete.
- It provides a configurable timeout to determine the maximum amount of time to wait for an evaluation to complete.
- It implements automatic retries for errors.
- It allows you to customize the behavior on unexpected errors, allowing you to consider Datadog failures as either an evaluation pass or a fail.
Note that the deployment gate
command is available in datadog-ci versions v3.17.0 and above.
Required environment variables:
DD_API_KEY
: API key used to authenticate the requests.DD_APP_KEY
: Application key used to authenticate the requests.DD_BETA_COMMANDS_ENABLED=1
: The deployment gate
command is a beta command, so datadog-ci needs to be run with beta commands enabled.
For complete configuration options and detailed usage examples, refer to the deployment gate
command documentation.
To call Deployment Gates from an Argo Rollouts Kubernetes Resource, you can create an AnalysisTemplate or a ClusterAnalysisTemplate. The template should contain a Kubernetes job that is used to perform the analysis.
Use this template as a starting point. For the DD_SITE environment variable, be sure to replace <YOUR_DD_SITE>
with your Datadog site name (for example,
).
The command requires an API key and application key. The safest way to provide them is by using Kubernetes Secrets. This example relies on a secret called datadog
holding two data values: api-key
and app-key
. Alternatively, you can also pass the values in plain text using value
instead of valueFrom
in the script below.
apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
name: datadog-job-analysis
spec:
args:
- name: service
- name: env
metrics:
- name: datadog-job
provider:
job:
spec:
ttlSecondsAfterFinished: 300
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: datadog-check
image: datadog/ci:v3.17.0
env:
- name: DD_BETA_COMMANDS_ENABLED
value: "1"
- name: DD_SITE
value: "<YOUR_DD_SITE>"
- name: DD_API_KEY
valueFrom:
secretKeyRef:
name: datadog
key: api-key
- name: DD_APP_KEY
valueFrom:
secretKeyRef:
name: datadog
key: app-key
command: ["/bin/sh", "-c"]
args:
- datadog-ci deployment gate --service {{ args.service }} --env {{ args.env }}
- The analysis template can receive arguments from the Rollout resource. In this case, the arguments are
service
and env
. Add any other optional fields if needed (such as version
). For more information, see the official Argo Rollouts docs. - The
ttlSecondsAfterFinished
field removes the finished jobs after 5 minutes. - The
backoffLimit
field is set to 0 as the job might fail if the gate evaluation fails, and it should not be retried in that case.
After you have created the analysis template, reference it from the Argo Rollouts strategy:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: rollouts-demo
labels:
tags.datadoghq.com/service: transaction-backend
tags.datadoghq.com/env: dev
spec:
replicas: 5
strategy:
canary:
steps:
...
- analysis:
templates:
- templateName: datadog-job-analysis
clusterScope: true # Only needed for cluster analysis
args:
- name: env
valueFrom:
fieldRef:
fieldPath: metadata.labels['tags.datadoghq.com/env']
- name: service
valueFrom:
fieldRef:
fieldPath: metadata.labels['tags.datadoghq.com/service']
- name: version #Only required if one or more APM Faulty Deployment Detection rules are evaluated
valueFrom:
fieldRef:
fieldPath: metadata.labels['tags.datadoghq.com/version']
- ...
Use this script as a starting point. For the API_URL variable, be sure to replace <YOUR_DD_SITE>
with your Datadog site name (for example,
).
#!/bin/sh
# Configuration
MAX_RETRIES=3
DELAY_SECONDS=5
POLL_INTERVAL_SECONDS=15
MAX_POLL_TIME_SECONDS=10800 # 3 hours
API_URL="https://api.<YOUR_DD_SITE>/api/unstable/deployments/gates/evaluation"
API_KEY="<YOUR_API_KEY>"
APP_KEY="<YOUR_APP_KEY>"
PAYLOAD=$(cat <<EOF
{
"data": {
"type": "deployment_gates_evaluation_request",
"attributes": {
"service": "$1",
"env": "$2",
"version": "$3"
}
}
}
EOF
)
# Step 1: Request evaluation
echo "Requesting evaluation..."
current_attempt=0
while [ $current_attempt -lt $MAX_RETRIES ]; do
current_attempt=$((current_attempt + 1))
RESPONSE=$(curl -s -w "%{http_code}" -o response.txt -X POST "$API_URL" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: $API_KEY" \
-H "DD-APPLICATION-KEY: $APP_KEY" \
-d "$PAYLOAD")
# Extracts the last 3 digits of the status code
HTTP_CODE=$(echo "$RESPONSE" | tail -c 4)
RESPONSE_BODY=$(cat response.txt)
if [ ${HTTP_CODE} -ge 500 ] && [ ${HTTP_CODE} -le 599 ]; then
# Status code 5xx indicates a server error, so the call is retried
echo "Attempt $current_attempt: 5xx Error ($HTTP_CODE). Retrying in $DELAY_SECONDS seconds..."
sleep $DELAY_SECONDS
continue
elif [ ${HTTP_CODE} -ge 400 ] && [ ${HTTP_CODE} -le 499 ]; then
# 4xx errors are client errors and not retriable
echo "Client error ($HTTP_CODE): $RESPONSE_BODY"
exit 1
fi
# Successfully started evaluation, extract evaluation_id
EVALUATION_ID=$(echo "$RESPONSE_BODY" | jq -r '.data.attributes.evaluation_id')
if [ "$EVALUATION_ID" = "null" ] || [ -z "$EVALUATION_ID" ]; then
echo "Failed to extract evaluation_id from response: $RESPONSE_BODY"
exit 1
fi
echo "Evaluation started with ID: $EVALUATION_ID"
break
done
if [ $current_attempt -eq $MAX_RETRIES ]; then
echo "All retries exhausted for evaluation request, but treating 5xx errors as success."
exit 0
fi
# Step 2: Poll for results
echo "Polling for results..."
start_time=$(date +%s)
poll_count=0
while true; do
poll_count=$((poll_count + 1))
current_time=$(date +%s)
elapsed_time=$((current_time - start_time))
# Check if we've exceeded the maximum polling time
if [ $elapsed_time -ge $MAX_POLL_TIME_SECONDS ]; then
echo "Evaluation polling timeout after ${MAX_POLL_TIME_SECONDS} seconds"
exit 1
fi
RESPONSE=$(curl -s -w "%{http_code}" -o response.txt -X GET "$API_URL/$EVALUATION_ID" \
-H "DD-API-KEY: $API_KEY" \
-H "DD-APPLICATION-KEY: $APP_KEY")
HTTP_CODE=$(echo "$RESPONSE" | tail -c 4)
RESPONSE_BODY=$(cat response.txt)
if [ ${HTTP_CODE} -eq 404 ]; then
# Evaluation might not have started yet, retry after a short delay
echo "Evaluation not ready yet (404), retrying in $POLL_INTERVAL_SECONDS seconds... (attempt $poll_count, elapsed: ${elapsed_time}s)"
sleep $POLL_INTERVAL_SECONDS
continue
elif [ ${HTTP_CODE} -ge 500 ] && [ ${HTTP_CODE} -le 599 ]; then
echo "Server error ($HTTP_CODE) while polling, retrying in $POLL_INTERVAL_SECONDS seconds... (attempt $poll_count, elapsed: ${elapsed_time}s)"
sleep $POLL_INTERVAL_SECONDS
continue
elif [ ${HTTP_CODE} -ge 400 ] && [ ${HTTP_CODE} -le 499 ]; then
# 4xx errors (except 404) are client errors and not retriable
echo "Client error ($HTTP_CODE) while polling: $RESPONSE_BODY"
exit 1
fi
# Check gate status
GATE_STATUS=$(echo "$RESPONSE_BODY" | jq -r '.data.attributes.gate_status')
if [ "$GATE_STATUS" = "pass" ]; then
echo "Gate evaluation PASSED"
exit 0
elif [ "$GATE_STATUS" = "fail" ]; then
echo "Gate evaluation FAILED"
exit 1
else
# Treat any other status (in_progress, unexpected, etc.) as still in progress
echo "Evaluation still in progress (status: $GATE_STATUS), retrying in $POLL_INTERVAL_SECONDS seconds... (attempt $poll_count, elapsed: ${elapsed_time}s)"
sleep $POLL_INTERVAL_SECONDS
continue
fi
done
The script has the following characteristics:
- It receives three inputs:
service
, environment
, and version
(optionally add identifier
and primary_tag
if needed). The version
is only required if one or more APM Faulty Deployment Detection rules are evaluated. - It sends a request to start the evaluation and records the evaluation_id. It handles various HTTP response codes appropriately:
- 5xx: Server errors, retries with delay.
- 4xx: Client error, evaluation fails.
- 2xx: Evaluation started successfully.
- It polls the evaluation status endpoint using the evaluation_id until the evaluation is complete.
- It handles various HTTP response codes appropriately:
- 5xx: Server errors, retries with delay.
- 404: Gate evaluation not started yet, retries with delay.
- 4xx errors (except 404): Client error, evaluation fails.
- 2xx: Successful response, check for gate status and retry with delay if not complete yet.
- The script polls every 10 seconds indefinitely until the evaluation completes or the maximum polling time (10800 seconds = 3 hours by default) is reached.
- If all the retries are exhausted for the initial request (5xx responses), the script treats this as success to be resilient to API failures.
This is a general behavior, and you should change it based on your personal use case and preferences. The script uses curl
(to perform the request) and jq
(to process the returned JSON). If those commands are not available, install them at the beginning of the script (for example, by adding apk add --no-cache curl jq
).
Deployment Gate evaluations are asynchronous, as the evaluation process can take some time to complete. When you trigger an evaluation, it’s started in the background, and the API returns an evaluation ID that can be used to track its progress. The high-level interaction with the Deployment Gates API is the following:
- First, request a Deployment Gate evaluation, which initiates the process and returns an evaluation ID.
- Then, periodically poll the evaluation status endpoint using the evaluation ID to retrieve the result when the evaluation is complete. Polling every 10-20 seconds is recommended.
A Deployment Gate evaluation can be requested with an API call:
curl -X POST "https://api.<span class="js-region-param region-param" data-region-param="dd_site"></span>/api/unstable/deployments/gates/evaluation" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: <YOUR_API_KEY>" \
-H "DD-APPLICATION-KEY: <YOUR_APP_KEY>" \
-d @- << EOF
{
"data": {
"type": "deployment_gates_evaluation_request",
"attributes": {
"service": "transaction-backend",
"env": "staging",
"identifier": "my-custom-identifier", # Optional, defaults to "default"
"version": "v123-456", # Required for APM Faulty Deployment Detection rules
"primary_tag": "region:us-central-1" # Optional, scopes down APM Faulty Deployment Detection rules analysis to the selected primary tag
}
}
}'
Note: A 404 HTTP response can be because the gate was not found, or because the gate was found but has no rules.
If the gate evaluation was successfully started, a 202 HTTP status code is returned. The response is in the following format:
{
"data": {
"id": "<random_response_uuid>",
"type": "deployment_gates_evaluation_response",
"attributes": {
"evaluation_id": "e9d2f04f-4f4b-494b-86e5-52f03e10c8e9"
}
}
}
The field data.attributes.evaluation_id
contains the unique identifier for this gate evaluation.
You can fetch the status of a gate evaluation by polling an additional API endpoint using the gate evaluation ID:
curl -X GET "https://api.<span class="js-region-param region-param" data-region-param="dd_site"></span>/api/unstable/deployments/gates/evaluation/<evaluation_id>" \
-H "DD-API-KEY: <YOUR_API_KEY>" \
-H "DD-APPLICATION-KEY: <YOUR_APP_KEY>"
Note: If you call this endpoint too quickly after requesting the evaluation, a 404 HTTP response may be returned because the evaluation did not start yet. If this is the case, retry a few seconds later.
When a 200 HTTP response is returned, it has the following format:
{
"data": {
"id": "<random_response_uuid>",
"type": "deployment_gates_evaluation_result_response",
"attributes": {
"dry_run": false,
"evaluation_id": "e9d2f04f-4f4b-494b-86e5-52f03e10c8e9",
"evaluation_url": "https://app.<span class="js-region-param region-param" data-region-param="dd_site"></span>/ci/deployment-gates/evaluations?index=cdgates&query=level%3Agate+%40evaluation_id%3Ae9d2f14f-4f4b-494b-86e5-52f03e10c8e9",
"gate_id": "e140302e-0cba-40d2-978c-6780647f8f1c",
"gate_status": "pass",
"rules": [
{
"name": "Check service monitors",
"status": "fail",
"reason": "One or more monitors in ALERT state: https://app.<span class="js-region-param region-param" data-region-param="dd_site"></span>/monitors/34330981",
"dry_run": true
}
]
}
}
}
The field data.attributes.gate_status
contains the result of the evaluation. It can contain one of these values:
in_progress
: The Deployment Gate evaluation is still in progress; you should continue polling.pass
: The Deployment Gate evaluation passed.fail
: The Deployment Gate evaluation failed.
Note: If the field data.attributes.dry_run
is true
, the field data.attributes.gate_status
is always pass
.