---
isPrivate: true
title: Datadog Disaster Recovery
description: Datadog, the leading service for cloud-scale monitoring.
breadcrumbs: Docs > Agent > Agent Guides > Datadog Disaster Recovery
---

# Datadog Disaster Recovery

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com, us2.ddog-gov.com

{% alert level="danger" %}
This product is not supported for your selected [Datadog site](https://docs.datadoghq.com/getting_started/site.md). ({% placeholder "user-datadog-site-name" /%}).
{% /alert %}

{% /callout %}

## Overview{% #overview %}

Datadog Disaster Recovery (DDR) provides you with observability continuity during events that may impact a cloud service provider region or Datadog services running within a cloud provider region. Using DDR, you can recover live observability at an alternate, functional Datadog site, enabling you to meet your critical observability availability goals.

DDR also allows you to periodically conduct disaster recovery drills to not only test your ability to recover from outage events, but to also meet your business and regulatory compliance needs.

## Prerequisites{% #prerequisites %}

The minimum version of the Datadog Agent you need depends on the types of telemetry you need to use:

| Supported telemetry | Supported products        | Agent version required |
| ------------------- | ------------------------- | ---------------------- |
| Logs                | Logs                      | v7.54+                 |
| Metrics             | Infrastructure Monitoring | v7.54+                 |
| Traces              | APM                       | v7.68+                 |

{% alert level="info" %}
Datadog is continuously evaluating customer requests to support DDR for additional products. Contact the [Disaster Recovery team](mailto:disaster-recovery@datadoghq.com) to learn about upcoming capabilities and your specific needs if they are not covered above.
{% /alert %}
 
## Setup{% #setup %}

To enable Datadog Disaster Recovery, follow these steps. If you have any questions about any of the steps, contact your [Customer Success Manager](mailto:success@datadoghq.com) or [Datadog Support](https://www.datadoghq.com/support/).

### 1. Create a DDR org and link it to your primary org

{% collapsible-section %}
##### Create and share your DDR org

{% alert level="info" %}
If required, Datadog can set this up for you.
{% /alert %}

#### Create your DDR org{% #create-your-ddr-org %}

1. Go to [Get Started with Datadog](https://app.datadoghq.com/signup). You may need to log out of your current session, or use incognito mode to access this page.
1. Choose a different Datadog site than your primary (for example, if you're on `US1`, choose `EU` or `US5`).
1. Follow the prompts to create an account.

All Datadog sites are geographically separated. Reference the [Datadog Site List](https://docs.datadoghq.com/getting_started/site.md#access-the-datadog-site) for options.

If you are also sending telemetry to Datadog using cloud provider integrations, you must add your cloud provider accounts in the DDR org. Datadog does not use cloud providers to receive telemetry data while the DDR site is passive (not in failover).

#### Share the DDR org information with Datadog{% #share-the-ddr-org-information-with-datadog %}

Email your new org name to your [Customer Success Manager](mailto:success@datadoghq.com). Then, your Customer Success Manager sets this new org as your DDR org.
{% /collapsible-section %}

{% collapsible-section %}
##### Retrieve the public IDs and link your DDR and primary orgs

For security reasons, Datadog is unable to link the orgs on your behalf.

After the Datadog team has set your DDR org, use the Datadog [public API endpoint](https://docs.datadoghq.com/api/latest/organizations.md#list-your-managed-organizations) to retrieve the public IDs of the primary and DDR org.

To link your DDR org to your primary org:

- Add the `disaster_recovery_status_write` scope to your application key in the primary org.
- Run the following commands, replacing the placeholders with the appropriate values.

```shell
export PRIMARY_DD_API_KEY=<PRIMARY_ORG_API_KEY>
export PRIMARY_DD_APP_KEY=<PRIMARY_ORG_APP_KEY>
export PRIMARY_DD_API_URL=<PRIMARY_ORG_API_SITE>

export DDR_ORG_ID=<DDR_ORG_PUBLIC_ID>
export PRIMARY_ORG_ID=<PRIMARY_ORG_PUBLIC_ID>
export USER_EMAIL=<USER_EMAIL>
export CONNECTION='{"data":{"id":"'${PRIMARY_ORG_ID}'","type":"hamr_org_connections","attributes":{"TargetOrgUuid":"'${DDR_ORG_ID}'","HamrStatus":1,"ModifiedBy":"'${USER_EMAIL}'", "IsPrimary":true}}}'

curl -v -H "Content-Type: application/json" -H \
"dd-api-key:${PRIMARY_DD_API_KEY}" -H \
"dd-application-key:${PRIMARY_DD_APP_KEY}" --data "${CONNECTION}" --request POST ${PRIMARY_DD_API_URL}/api/v2/hamr
```

After linking your orgs, only the failover org displays this banner:

{% image
   source="https://docs.dd-static.net/images/agent/guide/ddr/ddr-banner.e472f1a091274ed415e35dbe7d22062f.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/agent/guide/ddr/ddr-banner.e472f1a091274ed415e35dbe7d22062f.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="The DDR banner in the DDR org" /%}

{% /collapsible-section %}

### 2. Set up access, integrations, syncing, and agents

{% collapsible-section %}
##### Configure Single Sign On for the DDR org

**Datadog recommends using Single Sign On (SSO)** to enable all your users to log in to your Disaster Recovery org during an outage.

Go to the [Organization Settings](https://app.datadoghq.com/organization-settings/users) in your DDR org to configure [SAML](https://docs.datadoghq.com/account_management/saml.md#overview) or Google Login for your users.

Managed sync replicates user accounts from your primary org to your DDR org. Datadog recommends configuring [Just-in-Time provisioning with SAML](https://docs.datadoghq.com/account_management/saml.md#just-in-time-jit-provisioning) so users can access the DDR org during a failover without needing to reset their password.
{% /collapsible-section %}

{% collapsible-section %}
##### Set up your cloud integrations (AWS, Azure, Google Cloud)

See the [AWS](https://docs.datadoghq.com/integrations/amazon-web-services.md), [Azure](https://docs.datadoghq.com/integrations/azure.md), and [Google Cloud](https://docs.datadoghq.com/integrations/google-cloud-platform.md?tab=organdfolderlevelprojectdiscovery#overview) integrations for setup steps.

Your cloud integrations must be configured in both primary and DDR orgs, but they run in only one org at a time: by default in the primary org, and in the DDR org during failover.

For more information, see the Cloud integrations failover section.
{% /collapsible-section %}

{% collapsible-section #syncing-data %}
##### Set up credentials for managed resource sync

Datadog manages resource sync on your behalf using the open source [datadog-sync-cli](https://github.com/DataDog/datadog-sync-cli) tool. You do not need to run or operate this tool yourself.

Managed sync replicates resources from your primary org to your DDR org on a regular schedule. Replicated resources include dashboards, monitors, users, notebooks, and [34+ other resource types](https://github.com/DataDog/datadog-sync-cli#supported-resources). Replication runs on this schedule so your DDR org stays current before an outage.

**Users are scoped to each Datadog site.** Managed sync replicates user accounts to your DDR org. However, users may need to reset their password on first login to the DDR org. Datadog recommends configuring [Just-in-Time provisioning with SAML](https://docs.datadoghq.com/account_management/saml.md#just-in-time-jit-provisioning) so users can access the DDR org without manual password resets.

**Managed sync uses a Datadog [service account](https://docs.datadoghq.com/account_management/org_settings/service_accounts.md).** During onboarding, create a service account in your DDR org to read and replicate resources from your primary org. Resources synced by managed sync are provisioned by a user mapped to their original owner when possible.
{% /collapsible-section %}

{% collapsible-section %}
##### Enable Remote Configuration [**RECOMMENDED]

[Remote Configuration (RC)](https://docs.datadoghq.com/agent/remote_config.md?tab=configurationyamlfile) allows you to remotely configure and change the behavior of Datadog Agents deployed in your infrastructure.

Remote Configuration is enabled by default for new orgs, including your DDR org. Any new API keys you create are RC-enabled for use with your Agent. For more details, see the [Remote Configuration documentation](https://docs.datadoghq.com/agent/remote_config.md?tab=configurationyamlfile).

Datadog strongly recommends using Remote Configuration for better failover control. As an alternative to RC, you can manually configure your Agents or use configuration management tools such as Puppet, Ansible, or Chef.
{% /collapsible-section %}

{% collapsible-section %}
##### Dual ship telemetry to DDR org during failover or drills

To enable Dual Shipping, Datadog recommends using [Fleet Automation](https://docs.datadoghq.com/agent/fleet_automation.md#overview) for management at scale. Alternatively, you can configure it manually by editing your `datadog.yaml` file.

Contact your Datadog Customer Success Manager to schedule dedicated time windows for failover testing to measure performance and Recovery Time Objective (RTO).

{% tab title="Using Fleet Automation (recommended)" %}
From the [Fleet Automation](https://app.datadoghq.com/fleet) page in your failover org, on the Configure Agents tab, you can create a failover policy or reuse an existing one, and apply it to your fleet of Agents. Soon after the policy is enabled, Agents begin dual-shipping telemetry to both the primary and DDR (failover) observability sites.

To create a failover policy, click on Create Failover Policy.

{% image
   source="https://docs.dd-static.net/images/agent/guide/ddr/ddr-fa-policy.4c932994e6282dbf021e10a67ab4f910.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/agent/guide/ddr/ddr-fa-policy.4c932994e6282dbf021e10a67ab4f910.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="Manage DDR policies" /%}

Then, follow the prompt to scope the hosts and telemetry (metrics, logs, traces) that you are required to failover.

{% image
   source="https://docs.dd-static.net/images/agent/guide/ddr/ddr-fa-policy-scope.6219fdab9760a8ee0c9d918311b94916.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/agent/guide/ddr/ddr-fa-policy-scope.6219fdab9760a8ee0c9d918311b94916.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="Scope the hosts and telemetry required to failover" /%}

{% alert level="danger" %}
Cloud Integrations can only run in either your primary or DDR Datadog site, but not both at the same time, so failing them over ceases Cloud Integration data in your primary site. **During an integration failover, integrations run only in the DDR data center.** When no longer in failover, disable the failover policy to return integration data collection to the primary org.
{% /alert %}

{% /tab %}

{% tab title="Manually" %}
During a failover or failover exercises, update your Datadog Agent's `datadog.yaml` configuration file as shown in the example below and restart the Agent.

- `enabled: true` allows the Agent to send metadata (Data about the Agent and the infrastructure host. For example,  host name ,  host tags ,  Agent version .) to the DDR Datadog site so you can view Agents and your Infra hosts in the DDR org. This allows you to see your Agents and infrastructure hosts in the failover org.

- `failover_metrics`, `failover_logs`, and `failover_apm` are `false` by default. Setting these to `true` causes the Agent to start sending telemetry (Data that is sent to the Datadog platform. For example,  logs ,  metrics ,  traces .) to the DDR org.

```shell
multi_region_failover:
  enabled: true
  failover_metrics: false
  failover_logs: false
  failover_apm: false
  site: <DDR_SITE>  # For example "site: us5.datadoghq.com" for a US5 site
  api_key: <DDR_SITE_API_KEY>
```

{% /tab %}

{% /collapsible-section %}

{% collapsible-section %}
##### Configure DNS-based failover

DNS-based failover is a complementary approach to Agent-based failover. Instead of configuring Agents with a secondary site endpoint, you configure all your data sources to send telemetry to a single Datadog-provided custom intake URL. During a failover event, Datadog updates the DNS record for that URL to redirect traffic from your primary site to your DDR site.

{% alert level="info" %}
DNS failover is all-or-nothing. All telemetry sources using your custom endpoint cut over simultaneously.
{% /alert %}

#### Receive your custom DNS endpoint{% #receive-your-custom-dns-endpoint %}

If you choose to use DNS-based failover, Datadog provisions a custom intake URL for your organization (for example, `<your-org>.intake.datadoghq.com`). Configure all your data sources (Agents, log shippers, and custom instrumentation) to send telemetry to this endpoint instead of the default Datadog intake URL. This is a one-time configuration change.

#### Trigger a DNS failover{% #trigger-a-dns-failover %}

To initiate a DNS failover, contact Datadog through your [Customer Success Manager](mailto:success@datadoghq.com) or [Datadog Support](https://www.datadoghq.com/support/). Datadog updates the DNS record to redirect traffic from your primary site to your DDR site. The target Recovery Time Objective (RTO) from the time failover is initiated is 2 hours.

{% alert level="info" %}
A customer-controlled way to trigger DNS failover directly from the DDR org is in Preview. Contact your [Customer Success Manager](mailto:success@datadoghq.com) to learn more.
{% /alert %}

{% /collapsible-section %}

### 3. Run failover tests in various environments

{% collapsible-section %}
##### Activate and test DDR failover in Agent-based environments

To trigger a failover of your Agents, you can click on one of the policies in [Fleet Automation](https://app.datadoghq.com/fleet) in your DDR org, and then click Enable. The status of each host updates as the failover occurs.

{% image
   source="https://docs.dd-static.net/images/agent/guide/ddr/ddr-fa-policy-enable3.675dbd5b8c33b201fba461e3ff252adb.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/agent/guide/ddr/ddr-fa-policy-enable3.675dbd5b8c33b201fba461e3ff252adb.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="Enable the failover policy in the DDR org" /%}

Use the steps appropriate for your environment to activate/test the DDR failover.

{% tab title="Agent in non-containerized environments" %}
For Agent deployments in non-containerized environments, use the below Agent CLI commands:

```shell
agent config set multi_region_failover.failover_metrics true
agent config set multi_region_failover.failover_logs true
agent config set multi_region_failover.failover_apm true
```

{% /tab %}

{% tab title="Agent in containerized environments" %}
If you are running the Agent in a containerized environment like Kubernetes, you can still use the Agent command-line tool, but you need to invoke it on the container running the Agent. You can make changes using one of the following, depending on your needs:

- kubectl
- Agent configuration file (`datadog.yaml`)
- Helm chart or Datadog Operator

##### Using kubectl{% #using-kubectl %}

Below is an example of using `kubectl` to fail over metrics and logs for a Datadog Agent pod deployed with either the official Helm chart or Datadog Operator. The `<POD_NAME>` should be replaced with the name of the Agent pod:

```shell
kubectl exec <POD_NAME> -c agent -- agent config set multi_region_failover.failover_metrics true
kubectl exec <POD_NAME> -c agent -- agent config set multi_region_failover.failover_logs true
kubectl exec <POD_NAME> -c agent -- agent config set multi_region_failover.failover_apm true
```

##### Using the Agent configuration file{% #using-the-agent-configuration-file %}

Alternatively, you can specify the below settings in the main Agent configuration file (`datadog.yaml`) and restart the Datadog Agent for the changes to apply:

```shell
multi_region_failover:
  enabled: true
  failover_metrics: true
  failover_logs: true
  failover_apm: true
  site: NEW_ORG_SITE
  api_key: NEW_SITE_API_KEY
```

##### Using the Helm chart or Datadog Operator{% #using-the-helm-chart-or-datadog-operator %}

You can make similar changes with either the official Helm chart or Datadog Operator if you need to specify a custom configuration. Otherwise, you can pass the settings as environment variables:

```shell
DD_MULTI_REGION_FAILOVER_ENABLED=true
DD_MULTI_REGION_FAILOVER_FAILOVER_METRICS=true
DD_MULTI_REGION_FAILOVER_FAILOVER_LOGS=true
DD_MULTI_REGION_FAILOVER_FAILOVER_APM=true
DD_MULTI_REGION_FAILOVER_SITE=ADD_NEW_ORG_SITE
DD_MULTI_REGION_FAILOVER_API_KEY=ADD_NEW_SITE_API_KEY
```

{% /tab %}

{% /collapsible-section %}

{% collapsible-section #id-for-cloud %}
##### Activate and test DDR failover in cloud integrations

You can test failover for your cloud integrations from your DDR organization's landing page.

{% image
   source="https://docs.dd-static.net/images/agent/guide/ddr/ddr-failover-main-page.62441b2f7a7da4c39a5c8665a69663f7.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/agent/guide/ddr/ddr-failover-main-page.62441b2f7a7da4c39a5c8665a69663f7.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="Enable the failover policy in the DDR org" /%}

On the failover landing page, you can check the status of your DDR org, or click Fail over your integrations to test your cloud integration failover.

When no longer in failover, **disable the failover policy** in the DDR org to return integration data collection to the primary org.

During testing, integration telemetry is spread over both organizations. If you cancel a failover test, the integrations return to running in the primary data center.
{% /collapsible-section %}

## Further reading{% #further-reading %}

- [Remote Configuration](https://docs.datadoghq.com/agent/remote_config.md?tab=configurationyamlfile)
- [Getting Started with Datadog Sites](https://docs.datadoghq.com/getting_started/site.md)
- [Datadog Disaster Recovery mitigates cloud provider outages](https://www.datadoghq.com/blog/ddr-mitigates-cloud-provider-outages/)
