Outlier detection is an algorithmic feature that allows you to detect when a specific group is behaving different compared to its peers. For example, you could detect that one web server in a pool is processing an unusual number of requests, or significantly more 500 errors are happening in one AWS availability zone than the others.
To create an outlier monitor in Datadog, use the main navigation: Monitors –> New Monitor –> Outlier.
Any metric currently reporting to Datadog is available for monitors. For more information, see the Metric Monitor page.
The outlier monitor requires a metric with a group (hosts, availability zones, partitions, etc.) that has three or more members, which exhibit uniform behavior.
1 hour, etc. or
customto set a value between 1 minute and 24 hours.
30, etc. (only for
When setting up an outlier monitor, the time window is an important consideration. If the time window is too large, you might not be alerted in time. If the time window is too short, the alerts are not as resilient to one-off spikes.
To ensure your alert is properly calibrated, set the time window in the preview graph and use the reverse («) button to look back in time at outliers that would have triggered an alert. Additionally, you can use this feature to tune your parameters to a specific outlier algorithm.
Datadog offers two types of outlier detection algorithms:
scaledMAD. It is recommended to use the default algorithm, DBSCAN. If you have trouble detecting the correct outliers, adjust the parameters of DBSCAN or try the MAD algorithm. The scaled algorithms may be useful if your metrics are large scale and closely clustered.
DBSCAN (density-based spatial clustering of applications with noise) is a popular clustering algorithm. Traditionally, DBSCAN takes:
𝜀that specifies a distance threshold under which two points are considered to be close.
𝜀-radiusbefore that point can start agglomerating.
Datadog uses a simplified form of DBSCAN to detect outliers on timeseries. Each group is considered to be a point in d-dimensions, where d is the number of elements in the timeseries. Any point can agglomerate, and any point not in the largest cluster is considered an outlier. The initial distance threshold is set by creating a new median timeseries by taking the median of the values from the existing timeseries at every time point. The Euclidean distance between each group and the median series is calculated. The threshold is set as the median of these distances, multiplied by a normalizing constant.
This implementation of DBSCAN takes one parameter,
tolerance, the constant by which the initial threshold is multiplied to yield DBSCAN’s distance parameter 𝜀. Set the tolerance parameter according to how similarly you expect your groups to behave—larger values allow for more tolerance in how much a group can deviate from its peers.
MAD (median absolute deviation) is a robust measure of variability, and can be viewed as the robust analog for standard deviation. Robust statistics describe data in a way that is not influenced by outliers.
To use MAD for your outlier monitor, configure the parameters
Tolerance specifies the number of deviations a point needs to be away from the median for it to be considered an outlier. This parameter should be tuned depending on the expected variability of the data. For example, if the data is generally within a small range of values, then this should be small. Otherwise, if points can vary greatly, then set a higher scale so the variabilities do not trigger false positives.
Percent refers to the percentage of points in the series considered as outliers. If this percentage is exceeded, the whole series is marked as an outlier.
DBSCAN and MAD have scaled versions (scaledDBSCAN and scaledMAD). In most situations, the scaled algorithms behave the same as their regular counterparts. However, if DBSCAN/MAD algorithms are identifying outliers within a closely clustered group of metrics, and you would like the outlier detection algorithm to scale with the overall magnitude of the metrics, try the scaled algorithms.
So which algorithm should you use? For most outliers, any algorithm performs well at the default settings. However, there are subtle cases where one algorithm is more appropriate.
In the following image, a group of hosts is flushing their buffers together, while one host is flushing its buffer slightly later. DBSCAN picks this up as an outlier whereas MAD does not. This is a case where you might prefer to use MAD since the synchronization of the group is just an artifact of the hosts being restarted at the same time. On the other hand, if instead of flushed buffers, the metrics represented a scheduled job that should be synchronized across hosts, DBSCAN would be the correct choice.
For detailed instructions on the advanced alert options (auto resolve, new group delay, etc.), see the Monitor configuration page.
For detailed instructions on the Say what’s happening and Notify your team sections, see the Notifications page.
The outlier algorithms are set up to identify groups that are behaving differently from their peers. If your groups exhibit “banding” behavior as shown below (maybe each band represents a different shard), Datadog recommends tagging each band with an identifier, and setting up outlier detection alerts on each band separately.
Additional helpful documentation, links, and articles: