- 필수 기능
- 시작하기
- Glossary
- 표준 속성
- Guides
- Agent
- 통합
- 개방형텔레메트리
- 개발자
- API
- Datadog Mobile App
- CoScreen
- Cloudcraft
- 앱 내
- 서비스 관리
- 인프라스트럭처
- 애플리케이션 성능
- APM
- Continuous Profiler
- 스팬 시각화
- 데이터 스트림 모니터링
- 데이터 작업 모니터링
- 디지털 경험
- 소프트웨어 제공
- 보안
- AI Observability
- 로그 관리
- 관리
Anomaly detection is an algorithmic feature that identifies when a metric is behaving differently than it has in the past, taking into account trends, seasonal day-of-week, and time-of-day patterns. It is suited for metrics with strong trends and recurring patterns that are hard to monitor with threshold-based alerting.
For example, anomaly detection can help you discover when your web traffic is unusually low on a weekday afternoon—even though that same level of traffic is normal later in the evening. Or consider a metric measuring the number of logins to your steadily-growing site. Because the number increases daily, any threshold would be outdated, whereas anomaly detection can alert you if there is an unexpected drop—potentially indicating an issue with the login system.
To create an anomaly monitor in Datadog, use the main navigation: Monitors –> New Monitor –> Anomaly.
Any metric reporting to Datadog is available for monitors. For more information, see the Metric Monitor page.
Note: The anomalies
function uses the past to predict what is expected in the future, so using it on a new metric may yield poor results.
After defining the metric, the anomaly detection monitor provides two preview graphs in the editor:
Trigger an alert if the values have been above or below
, above
, or below
the bounds for the last 15 minutes
, 1 hour
, etc. or custom
to set a value between 15 minutes and 24 hours. Recover if the values are within the bounds for at least 15 minutes
, 1 hour
, etc. or custom
to set a value between 15 minutes and 24 hours.
above or below
) a metric is considered to be anomalous if it is outside of the gray anomaly band. Optionally, you can specify whether being only above
or below
the bands is considered anomalous.Note: The range of accepted values for the Recovery Window depends on the Trigger Window and the Alert Threshold to ensure the monitor can’t both satisfy the recovery and the alert condition at the same time. Example:
Threshold
: 50%Trigger window
: 4h
The range of accepted values for the recovery window is between 121 minutes (4h*(1-0.5) +1 min = 121 minutes
) and 4 hours. Setting a recovery window below 121 minutes could lead to a 4 hour timeframe with both 50% of anomalous points and the last 120 minutes with no anomalous points.Another example:
Threshold
: 80%Trigger window
: 4h
The range of accepted values for the recovery window is between 49 minutes (4h*(1-0.8) +1 min = 49 minutes
) and 4 hours.Datadog automatically analyzes your chosen metric and sets several parameters for you. However, the options are available for you to edit under Advanced Options.
basic
, agile
, or robust
).hourly
, daily
, or weekly
) of the cycle for the agile
or robust
algorithm to analyze the metric.agile
or robust
anomaly detection with weekly
or daily
seasonality. For more information, see Anomaly Detection and Time Zones.Required data history for Anomaly Detection algorithm: Machine learning algorithms require at least three time as much historical data time as the chosen seasonality time to compute the baseline. For example:
All of the seasonal algorithms may use up to six weeks of historical data when calculating a metric’s expected normal range of behavior. By using a significant amount of past data, the algorithms avoid giving too much weight to abnormal behavior that might have occurred in the recent past.
The graphs below illustrate how and when these three algorithms behave differently from one another.
In this example, basic
successfully identifies anomalies that spike out of the normal range of values, but it does not incorporate the repeating, seasonal pattern into its predicted range of values. By contrast, robust
and agile
both recognize the seasonal pattern and can detect more nuanced anomalies, for example if the metric was to flat-line near its minimum value. The trend also shows an hourly pattern, so the hourly seasonality works best in this case.
In this example, the metric exhibits a sudden level shift. Agile
adjusts more quickly to the level shift than robust
. Also, the width of robust
’s bounds increases to reflect greater uncertainty after the level shift; the width of agile
’s bounds remains unchanged. Basic
is clearly a poor fit for this scenario, where the metric exhibits a strong weekly seasonal pattern.
This example shows how the algorithms react to an hour-long anomaly. Robust
does not adjust the bounds for the anomaly in this scenario since it reacts more slowly to abrupt changes. The other algorithms start to behave as if the anomaly is the new normal. Agile
even identifies the metric’s return to its original level as an anomaly.
The algorithms deal with scale differently. Basic
and robust
are scale-insensitive, while agile
is not. The graphs on the left below show agile
and robust
mark the level-shift as being anomalous. On the right, 1000 is added to the same metric, and agile
no longer calls out the level-shift as being anomalous whereas robust
continues do so.
This example shows how each algorithm handles a new metric. Robust
and agile
does not show any bounds during the first few seasons (weekly). Basic
starts showing bounds shortly after the metric first appears.
For detailed instructions on the advanced alert options (auto resolve, evaluation delay, etc.), see the Monitor configuration page. For the metric-specific option full data window, see the Metric monitor page.
For detailed instructions on the Configure notifications and automations section, see the Notifications page.
Customers on an enterprise plan can create anomaly detection monitors using the create-monitor API endpoint. Datadog strongly recommends exporting a monitor’s JSON to build the query for the API. By using the monitor creation page in Datadog, customers benefit from the preview graph and automatic parameter tuning to help avoid a poorly configured monitor.
Note: Anomaly detection monitors are only available to customers on an enterprise plan. Customers on a pro plan interested in anomaly detection monitors should reach out to their customer success representative or email the Datadog billing team.
Anomaly monitors are managed using the same API as other monitors. These fields are unique for anomaly monitors:
query
The query
property in the request body should contain a query string in the following format:
avg(<query_window>):anomalies(<metric_query>, '<algorithm>', <deviations>, direction='<direction>', alert_window='<alert_window>', interval=<interval>, count_default_zero='<count_default_zero>' [, seasonality='<seasonality>']) >= <threshold>
query_window
last_4h
or last_7d
. The time window displayed in graphs in notifications. Must be at least as large as the alert_window
and is recommended to be around 5 times the alert_window
.metric_query
sum:trace.flask.request.hits{service:web-app}.as_count()
).algorithm
basic
, agile
, or robust
.deviations
direction
above
, below
, or both
.alert_window
last_5m
, last_1h
).interval
alert_window
duration.count_default_zero
true
for most monitors. Set to false
only if submitting a count metric in which the lack of a value should not be interpreted as a zero.seasonality
hourly
, daily
, or weekly
. Exclude this parameter when using the basic
algorithm.threshold
alert_window
that must be anomalous in order for a critical alert to trigger.Below is an example query for an anomaly detection monitor, which alerts when the average Cassandra node’s CPU is three standard deviations above the ordinary value over the last 5 minutes:
avg(last_1h):anomalies(avg:system.cpu.system{name:cassandra}, 'basic', 3, direction='above', alert_window='last_5m', interval=20, count_default_zero='true') >= 1
options
Most of the properties under options
in the request body are the same as for other query alerts, except for thresholds
and threshold_windows
.
thresholds
critical
, critical_recovery
, warning
, and warning_recovery
thresholds. Thresholds are expressed as numbers from 0 to 1, and are interpreted as the fraction of the associated window that is anomalous. For example, an critical
threshold value of 0.9
means that a critical alert triggers when at least 90% of the points in the trigger_window
(described below) are anomalous. Or, a warning_recovery
value of 0 means that the monitor recovers from the warning state only when 0% of the points in the recovery_window
are anomalous.critical
threshold
should match the threshold
used in the query
.threshold_windows
threshold_windows
property in options
. threshold_windows
must include both two properties—trigger_window
and recovery_window
. These windows are expressed as timeframe strings, such as last_10m
or last_1h
. The trigger_window
must match the alert_window
from the query
. The trigger_window
is the time range which is analyzed for anomalies when evaluating whether a monitor should trigger. The recovery_window
is the time range that analyzed for anomalies when evaluating whether a triggered monitor should recover.A standard configuration of thresholds and threshold window looks like:
"options": {
...
"thresholds": {
"critical": 1,
"critical_recovery": 0
},
"threshold_windows": {
"trigger_window": "last_30m",
"recovery_window": "last_30m"
}
}