- 필수 기능
- 시작하기
- Glossary
- 표준 속성
- Guides
- Agent
- 통합
- 개방형텔레메트리
- 개발자
- Administrator's Guide
- API
- Datadog Mobile App
- CoScreen
- Cloudcraft
- 앱 내
- 서비스 관리
- 인프라스트럭처
- 애플리케이션 성능
- APM
- Continuous Profiler
- 스팬 시각화
- 데이터 스트림 모니터링
- 데이터 작업 모니터링
- 디지털 경험
- 소프트웨어 제공
- 보안
- AI Observability
- 로그 관리
- 관리
High durability is the ability to retain data when there are system failures. The aggregator architecture is designed to take on the responsibility of high durability. This simplifies your durability strategy by shifting the burden away from your agents and localizing it to your aggregators. In addition, this concentrated approach allows for durability strategies that would be difficult to implement across all of your agent nodes.
To achieve high durability:
Configure your agents to be simple data forwarders and stream data directly to your Observability Pipelines Worker aggregator. This reduces the amount of time your data is exposed to loss at the edge since it is not yet redundant.
Choose a highly durable destination that serves as your system of record (for example, Amazon S3). This system is responsible for the durability of data at rest and is commonly referred to as archives or data lakes.
Finally, configure the Observability Pipelines Worker sinks that write to your system of record to enable end-to-end acknowledgments and disk buffers. For example:
sinks:
aws_s3:
acknowledgments: true
buffer:
type: "disk"
An issue with the Observability Pipelines Worker operating system process could risk losing data held in memory during the time of the issue. Enable Observability Pipelines Worker’s end-to-end acknowledgment feature to mitigate the risk of losing data:
sinks:
aws_s3:
acknowledgments: true
With this feature enabled, Observability Pipelines Worker does not respond to agents until the data has been durably persisted. This prevents the agent from releasing the data prematurely and sending it again if an acknowledgment has not been received.
Node failures deal with the full failure of an individual node. These can also be addressed using end-to-end acknowledgments. See Using end-to-end acknowledgment for more details.
Disk failures deal with the failure of an individual disk. Data loss related to disk failures can be mitigated by using a highly durable file system where data is replicated across multiple disks, such as block storage (for example, Amazon EBS).
The Observability Pipelines Worker can have problems, such as failing to parse a log, when trying to process malformed data. There are two ways to mitigate this issue:
Direct archiving: Route data directly from your sources to your archive. This ensures that data makes it to your archive without risk of being dropped. In addition, this data can be replayed after correcting the processing error.
Failed event routing: The Observability Pipelines Worker offers failed event routing for users who wish to archive processed data, such as structured and enriched data. Certain Observability Pipelines Worker transforms come with a dropped output that can be connected to a sink for durability and replay.
If durability is the most important criteria, use the direct archiving method because it addresses data loss scenarios. Use the failed event routing method, also commonly referred to as a data lake, if you prefer to analyze data in your archive. It has the advantage of using your archive/data lake for long-term analysis. Datadog Log Archives and Amazon Athena are examples of archive storage solutions.
Destination failures refer to the total failure of a downstream destination (for example, Elasticsearch). Data loss can be mitigated for issues with the downstream destination by using disk buffers large enough to sustain the outage time. This allows data to durably buffer while the service is down and then drain when the service comes back up. For this reason, disk buffers large enough to hold at least one hour’s worth of data are recommended. See Optimizing the Instance for more details.