---
isPrivate: true
title: (LEGACY) High Availability and Disaster Recovery
description: Datadog, the leading service for cloud-scale monitoring.
breadcrumbs: >-
  Docs > Observability Pipelines > (LEGACY) Observability Pipelines
  Documentation > (LEGACY) Best Practices for OPW Aggregator Architecture >
  (LEGACY) High Availability and Disaster Recovery
---

# (LEGACY) High Availability and Disaster Recovery

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com

{% alert level="danger" %}
This product is not supported for your selected [Datadog site](https://docs.datadoghq.com/getting_started/site). ().
{% /alert %}

{% /callout %}

{% alert level="info" %}
This guide is for large-scale production-level deployments.
{% /alert %}

In the context of Observability Pipelines, high availability refers to the Observability Pipelines Worker remaining available if there are any system issues.

{% image
   source="https://datadog-docs.imgix.net/images/observability_pipelines/production_deployment_overview/high_availability.f13f0530e5f6effbdbf25d9084792b69.png?auto=format"
   alt="A diagram showing availability zone one with load balancer one offline, and both agents sending data to load balancer two and then to Worker one and Worker two. In availability zone two, Worker three is down, so both load balancers are sending data to Worker N" /%}

To achieve high availability:

1. Deploy at least two Observability Pipelines Worker instances in each Availability Zone.
1. Deploy Observability Pipelines Worker in at least two Availability Zones.
1. Front your Observability Pipelines Worker instances with a load balancer that balances traffic across Observability Pipelines Worker instances. See [Capacity Planning and Scaling](https://docs.datadoghq.com/observability_pipelines/legacy/architecture/capacity_planning_scaling) for more information.

## Mitigating failure scenarios{% #mitigating-failure-scenarios %}

### Handling Observability Pipelines Worker process issues{% #handling-observability-pipelines-worker-process-issues %}

To mitigate a system process issue, distribute the Observability Pipelines Worker across multiple nodes and front them with a network load balancer that can redirect traffic to another Observability Pipelines Worker instance as needed. In addition, platform-level automated self-healing should eventually restart the process or replace the node.

{% image
   source="https://datadog-docs.imgix.net/images/observability_pipelines/production_deployment_overview/process_failure.48b6c8c4778763038d0835a71f01142e.png?auto=format"
   alt="A diagram showing three nodes, where each node has an Observability Pipelines Worker" /%}

### Mitigating node failures{% #mitigating-node-failures %}

To mitigate node issues, distribute the Observability Pipelines Worker across multiple nodes and front them with a network load balancer that can redirect traffic to another Observability Pipelines Worker node. In addition, platform-level automated self-healing should eventually replace the node.

{% image
   source="https://datadog-docs.imgix.net/images/observability_pipelines/production_deployment_overview/node_failure.a3b410e97812e1c98e7bbb7e282e141d.png?auto=format"
   alt="A diagram showing data going to node one's load balancer, but because the Observability Pipelines Worker is down in node one, the data is sent to the Workers in node two and node N" /%}

### Handling availability zone failures{% #handling-availability-zone-failures %}

To mitigate issues with availability zones, deploy the Observability Pipelines Worker across multiple availability zones.

{% image
   source="https://datadog-docs.imgix.net/images/observability_pipelines/production_deployment_overview/availability_zone_failure.ea3e52a7e919262da659cd9a2e44c3f2.png?auto=format"
   alt="A diagram showing the load balancers and Observability Pipelines Worker down in availability zone one, but load balancers and Workers in zone N still receiving and sending data" /%}

### Mitigating region failures{% #mitigating-region-failures %}

Observability Pipelines Worker is designed to route internal observability data, and it should not failover to another region. Instead, Observability Pipelines Worker should be deployed in all of your regions. Therefore, if your entire network or region fails, Observability Pipelines Worker would fail with it. See [Networking](https://docs.datadoghq.com/observability_pipelines/legacy/architecture/networking) for more information.

## Disaster recovery{% #disaster-recovery %}

### Internal disaster recovery{% #internal-disaster-recovery %}

Observability Pipelines Worker is an infrastructure-level tool designed to route internal observability data. It implements a shared-nothing architecture and does not manage state that should be replicated or transferred to a disaster recovery (DR) site. Therefore, if your entire region fails, Observability Pipelines Worker would fail with it. Therefore, you should install the Observability Pipelines Worker in your DR site as part of your broader DR plan.

### External disaster recovery{% #external-disaster-recovery %}

If you're using a managed destination, such as Datadog, Observability Pipelines Worker can facilitate automatic routing of data to your Datadog DR site using Observability Pipelines Worker's circuit breaker feature.

{% image
   source="https://datadog-docs.imgix.net/images/observability_pipelines/production_deployment_overview/external_disaster_recovery.d20f881476394b1e73c3a97e3b3fc240.png?auto=format"
   alt="A diagram showing Observability Pipelines Workers in different zones, and all sending data to the same disaster recovery destination" /%}
