Understanding Synthetic Monitor Alerting
이 페이지는 아직 한국어로 제공되지 않습니다. 번역 작업 중입니다.
현재 번역 프로젝트에 대한 질문이나 피드백이 있으신 경우
언제든지 연락주시기 바랍니다.Synthetic Monitoring evaluates test results over time, not individual test executions.
This page explains how Datadog determines when a Synthetic Monitoring notification triggers an alert or recovers, and why alerts may behave differently than expected.
Use this page to understand:
Common reasons for unexpected alert behavior
If a monitor does not alert or recovers unexpectedly, check for the following:
How alert evaluation works
Synthetic Monitoring does not trigger alerts based on a single failed run. Instead, it continuously evaluates test results in the following order:
- The test runs based on its configured schedule.
- Fast retries are applied, if configured.
- Test results are aggregated across locations.
- Failures are evaluated over time using the alerting rules.
- The monitor transitions between OK, Alert, or No Data status as the alerting conditions are met or no longer met.
Test runs that generate alerts
| Test run type | Evaluated for alerting |
|---|
| Scheduled runs | Yes |
| CI/CD-triggered runs | No |
| Manually triggered runs (unpaused test) | Yes, if state changes |
| Manually triggered runs (paused test) | No |
Fast retries
Fast retries automatically re-run failed test executions. A test configured with n retries can execute up to n + 1 times per scheduled run (including the original attempt).
If you have a minimum duration configured as an alerting rule, the timer starts when the final fast retry execution fails. Fast retry runs appear in test results with a (fast retry) label in the Run Type column.
Alerting rules
Alerting rules define when a monitor is allowed to change state based on test failures over time. When fast retries are enabled, the monitor waits until all retry attempts are finished before it marks a test run as failed or triggers alert evaluations. An alert triggers only when all alerting conditions are met continuously for the configured duration.
Alerting rules typically include:
Minimum duration (alerting delay)
How long failures must persist before triggering an alert.
Location scope
For example, any 1 of N locations or all locations.
If any part of the alerting rule stops being true during the evaluation window, the minimum duration timer resets.
Test frequency and minimum duration
Test frequency and minimum duration work together to determine when a monitor can alert. These two settings are commonly confused because they both affect alert timing, but they serve different purposes:
- Test frequency: How often the test runs. This determines how soon failures can be detected and how frequently the alerting rules are evaluated.
- Minimum duration: How long the test must continuously fail before alerting. This prevents alerts from triggering on brief, transient issues.
Note: If you have fast retries enabled, the minimum duration timer starts when the final fast retry test execution fails.
Understanding how these settings interact helps explain why alerts may take longer to trigger than expected, especially when minimum duration exceeds test frequency.
- Fast retries (not configured)
- Test frequency: 15 minutes
- Minimum duration: 13 minutes
- Location scope: 1 of 1
With the above settings, the alert triggers 13 minutes after the scheduled test runs have failed:
| Time | Event | Result | Monitor status |
|---|
| t0 | Scheduled test runs | Pass | OK |
| t15 | Scheduled test runs | Fail | OK (Minimum duration timer starts) |
| t28 | N/A | Fail | ALERT (13 minutes elapsed) |
Note: This configuration is not recommended because it lacks fast retries and alerts on a single failure, which can lead to false positives from transient issues. Instead, consider shortening the test frequency to 5 minutes and/or enabling fast retries. This approach allows additional test executions to run during transient issues, reducing false positives while still ensuring timely alerts for real, persistent problems.
Example: Fast retries causing a delay in alerting
- Fast retries: 2 retries, with 1 minutes between retries
- Test frequency: 30 minutes
- Minimum duration: 5 minutes
- Location scope: 1 of 1
With the above settings, the minimum duration timer starts when the second fast retry fails:
| Time | Event | Result | Monitor status |
|---|
| t0 | Scheduled test runs | Pass | OK |
| t30 | Scheduled test runs | Fail | OK |
| t31 | First fast retry for scheduled test run at t30 | Fail | OK |
| t32 | Second fast retry for scheduled test run at t30 | Fail | OK (Minimum duration timer starts) |
| t37 | N/A | Fail | ALERT (5 minutes elapsed) |
| t60 | Scheduled test runs | Pass | OK |
Note: Because fast retries were configured, the alert triggered at t37 instead of t35, adding a 2-minute delay.
Best practices
- If you need immediate alerting, set the minimum duration to
0 to alert as soon as a failure occurs. This approach, however, does not allow additional test executions during transient issues, leading to false positives. Instead, enable fast retries to handle transient issues like network blips. For frequently running tests, pair fast retries with a longer minimum duration to reduce alert noise. - Avoid overlapping fast retries with scheduled test runs to help you determine which fast retries are associated with its related scheduled test runs.
Location-based evaluation
Location rules determine how many locations must fail for an alert to trigger.
Common patterns include:
- Fail from any 1 of N locations
- Fail from all locations
- At one moment, all locations were failing
A monitor can recover even if some locations are still failing, as long as the configured alerting rules are no longer satisfied during the evaluation window.
Alert and recovery behavior
A recovery does not require all test runs to pass, only that the alerting conditions are no longer true.
- Alert notifications are sent when alerting rules are met.
- Recovery notifications are sent when alerting rules are no longer met.
Global uptime and alert state
Global uptime represents the percentage of time your monitor was healthy (OK status) during the selected time period.
It is based on how long the monitor stayed in an OK state compared to the total monitoring period. Any time the monitor spends in an ALERT state lowers the global uptime.
Because this metric is based on the duration of the monitor’s status and not on the status of a test execution, it cannot be reliably calculated based on the ratio of successful test results to the total number of test executions over the same period.
Depending on the test frequency, there may be times when the ratio can be used to “approximate” the global uptime. In some basic alerting configurations, such as a test that runs every minute with a minimum duration of 0, the ratio might roughly approximate the global uptime.
The formula for calculating global uptime is:
Global Uptime = ((Total Period - Time in Alert) / Total Period) × 100
Example calculation
The following example demonstrates how a 95.83% global uptime is calculated.
Identify the monitoring period.
The monitor is scoped to Jan 12, 10:56 AM - Jan 12, 4:56 PM, a 360-minute period:
Determine the time spent in alert status.
Zoom into the time range to identify when the monitor was in an alert state:
The alert period is Jan 12, 3:46 PM – Jan 12, 4:01 PM, approximately 15 minutes.
Apply the formula.
Total Period = 360 minutes
Time in Alert = 15 minutes
Global Uptime = ((360 - 15) / 360) × 100 = 95.83%
Monitor status reference
- OK
- The monitor is healthy. Either all test runs are passing, or failures have not met the alerting conditions (minimum duration and location requirements).
- ALERT
- The alerting conditions have been met. The test has been failing continuously for the configured minimum duration across the required number of locations.
- NO DATA
- The monitor has not received any test results from any location (managed, private, or Datadog Agent) during the queried time period. Common causes include:
- The test is paused: Paused tests do not execute and produce no data.
- Advanced schedule configuration: The queried time period falls outside the test’s configured schedule windows.
- Delay in test execution: The test has not yet run during the selected time period. This typically occurs with overloaded private locations, which may cause intermittent timeouts, missed runs, gaps in the test schedule, or the private location stopped reporting.
When these symptoms are present, too many tests are assigned to the private location for it to handle. You can resolve this by adding workers, increasing concurrency, or adding compute resources. See Dimensioning Private Locations for more information.
- Delay in data ingestion: Test results have not yet been processed and are not available for the queried time period.
Further Reading