- 필수 기능
- 시작하기
- Glossary
- 표준 속성
- Guides
- Agent
- 통합
- 개방형텔레메트리
- 개발자
- Administrator's Guide
- API
- Datadog Mobile App
- CoScreen
- Cloudcraft
- 앱 내
- 서비스 관리
- 인프라스트럭처
- 애플리케이션 성능
- APM
- Continuous Profiler
- 스팬 시각화
- 데이터 스트림 모니터링
- 데이터 작업 모니터링
- 디지털 경험
- 소프트웨어 제공
- 보안
- AI Observability
- 로그 관리
- 관리
Monitors are essential for keeping businesses and systems running smoothly. When a monitor alerts, it signals that attention is needed. However, detecting an issue is only the tip of the iceberg, the notification is what greatly impacts the resolution time.
Notification messages bridge the gap between your monitoring system and problem solvers. Unclear or poorly written messages can cause confusion, slow down response times, or unresolved issues. Whereas a clear and actionable message helps your team quickly understand what’s wrong and what to do next.
Use this guide to improve your notification messages and learn about:
From product managers to developers, this resource ensures notifications enhance system reliability and team efficiency.
The first step is to configure the notification with the required fields:
Craft the Monitor Name to include key information for the responder to quickly understand the alert context. The monitor title should give a clear and concise description of the signal, including:
Needs Revision | Improved Title |
---|---|
Memory usage | High memory usage on {{pod_name.name}} |
While both the examples refer to a memory consumption monitor, the improved title gives a thorough representation with essential context for focused investigation.
On-call responders rely on the notification body to understand and act on alerts. Write concise, accurate, and legible messages for clarity.
Read the following sections to explore advanced features that can further enhance your monitor messages.
Monitor message variables are dynamic placeholders that allow you to customize notification messages with real-time contextual information. Use variables to enhance message clarity, and provide detailed context. There are two types of variables:
Variable Type | Description |
---|---|
Conditional | Uses “if-else” logic to adjust the message context based on conditions like monitor state. |
Template | Enriches monitor notifications with contextual information. |
Variables are especially important in a Multi-Alert monitor. When triggered, you need to know which group is responsible. For example, monitoring CPU usage by container, grouped by host. A valuable variable is {{host.name}} indicating the host that triggered the alert.
These variables allow you to tailor the notification message by implementing branch logic based on your needs and use case. Use conditional variables to notify different people/groups depending on the group that triggered the alert.
{{#is_exact_match "role.name" "network"}}
# The content displays if the host triggering the alert contains `network` in the role name, and only notifes @network-team@company.com.
@network-team@company.com
{{/is_exact_match}}
You can receive a notification if the group that triggered the alert contains a specific string.
{{#is_match "datacenter.name" "us"}}
# The content displays if the region triggering the alert contains `us` (such as us1 or us3)
@us.datacenter@company.com
{{/is_match}}
For more information and examples, see the Conditional Variables documentation.
Add monitor template variables to access the metadata that caused your monitor to alert, such as {{value}}, but also information related to the context of the alert.
For example, if you want to see the hostname, IP and value of the monitor query:
The CPU for {{host.name}} (IP:{{host.ip}}) reached a critical value of {{value}}.
For the list of available template variables, see the documentation.
You can also use template variables to create dynamic links and handles that automatically route your notifications.
Example of handles:
@slack-{{service.name}} There is an ongoing issue with {{service.name}}.
Results in the following when the group service:ad-server triggers:
@slack-ad-server There is an ongoing issue with ad-server.
Example of links:
[https://app.datadoghq.com/dash/integration/system_overview?tpl_var_scope=host:{{host.name](https://app.datadoghq.com/dash/integration/system_overview?tpl_var_scope=host:{{host.name)}}
## What’s happening?
The CPU usage on {{host.name}} has exceeded the defined threshold.
Current CPU Usage: {{value}}
Threshold: {{threshold}}
Time: {{last_triggered_at_epoch}}
## Impact
1. Customers are experiencing lag on the website.
2. Timeouts and Errors.
## Why?
There can be several reasons as to why the CPU usage exceeded the threshold:
## How to troubleshoot/solve the issue?
1. Analyze workload to identify CPU-intensive processes.
a. for OOM - [increase pod limits if too low](<Link>)
2. Upscale {{host.name}} capacity by adding more replicas:
a. directly: <Code to do so>
b. change configuration through [add more replicas runbook](<Link>)
3. Check for any [Kafka issues](<Link>)
4. Check for any other outages/incident (attempted connections)
## Related links
* [Troubleshooting Dashboard](<Link>)
* [App Dashboard](<Link>)
* [Logs](<Link>)
* [Infrastructure](<Link>)
* [Pipeline Overview](<Link>)
* [App Documentation](<Link>)
* [Failure Modes](<Link>)
추가 유용한 문서, 링크 및 기사: