Outlier Detection is an algorithmic feature that allows you to detect when some members of a group are behaving strangely compared to the others. For example, you could detect that one web server in a pool is processing an unusual number of requests, and hence should be a target for replacement. Or, you could get an early warning that significantly more 500s are happening in one AWS Availability Zone (AZ) than the others, which might indicate an issue brewing in that AZ.
outliers query function, when applied to your query, returns the usual results but with outliers series marked.
You can use this function to display and alert on outliers in your data. To try it out, you’ll first need a metric for which a group of hosts (or availability zones, partitions, etc) should exhibit uniform behavior. For the function to work, be sure that there are at least 3 or more members in the group. Given that, here are two ways to use outlier detection on that group.
Here’s a graph of Gunicorn requests by host with outlier detection enabled.
You can see that one of the series is an outlier: it is handling significantly lower traffic than the others for the time window in question.
To set up an outlier detection graph for your data add a metric to the graph showing all series in the groups. Then apply the outlier detection algorithm by adding the
outliers function on your data. After applying the function, any outlier series is colored with a bold, warm palette, while all other series are colored with a lightweight, greyscale color palette.
First create a new timeseries graph on your dashboard with your chosen metric.
To enable outlier detection, click on the
+ icon on the right side of the metrics line. Choose Algorithms from the function categories, then pick one of the four outlier algorithms.
This applies the outliers function to your graph, and you’ll see any outliers in the group highlighted in bold, warm colors.
There are several outlier detection algorithms you can choose. The default algorithm (DBSCAN) and parameter values should work for most scenarios. However, if you see too many or too few outliers identified, you can tune the algorithm or try an alternate algorithm. To learn more, see the “Outlier Algorithms and Parameters” section below.
You can also define a monitor to alert when an outlier is detected in an important group.
For example, to alert when a Cassandra host is abnormally loaded compared to the rest of the group, you can add a new outlier monitor for the metric.
Navigate to the New Monitor page and click Outlier. Then fill out the Define the metric section just as you would for any other monitor.
In the alert conditions, select the grouping and timeframe. Then select an algorithm and parameter values to use for outlier detection.
To ensure that your alert is properly calibrated, you can set the time window at the top of the screen and use the reverse (<<) button to look back in time for when outliers would have be found and alerted. This is also a good way to tune the parameters to the specific outliers algorithm you’re using.
There are two different types of outlier detection algorithms you can use on your data: DBSCAN/ScaledDBSCAN and MAD/ScaledMAD. Datadog recommends starting with the default algorithm, DBSCAN. If you have trouble detecting the right outliers, you can adjust the parameters to DBSCAN or try the alternate algorithm, MAD. If you have metrics on a larger scale that look to be closely clustered but the DBSCAN/MAD algorithms are identifying some as outliers, try the scaled algorithms. Our blog post on outlier detection has more detailed information.
DBSCAN (density-based spatial clustering of applications with noise) is a popular clustering algorithm. Traditionally, DBSCAN takes: 1) a parameter 𝜀 that specifies a distance threshold under which two points are considered to be close; and 2) the minimum number of points that have to be within a point’s 𝜀-radius before that point can start agglomerating.
Datadog uses a simplified form of DBSCAN to detect outliers on timeseries. Datadog considers each host to be a point in d-dimensions, where d is the number of elements in the timeseries. Any point can agglomerate, and any point not in the largest cluster is considered an outlier. The initial distance threshold is set by creating a new median timeseries by taking the median of the values from the existing timeseries at every time point. The Euclidean distance between each host and the median series is calculated. The threshold is set as the median of these distances, multiplied by a normalizing constant.
This implementation of DBSCAN takes one parameter,
tolerance, the constant by which the initial threshold is multiplied to yield DBSCAN’s distance parameter 𝜀.
Here is DBSCAN with a tolerance of 3.0 in action on a pool of Cassandra workers:
Set the tolerance parameter according to how similarly you expect your hosts to behave—larger values allow for more tolerance in how much a host can deviate from its peers.
The Median Absolute Deviation(MAD) is a robust measure of variability, and can be viewed as the robust analog for standard deviation. Robust statistics describe data in such a way that they are not unduly influenced by outliers.
To use MAD for your outlier monitor, configure two parameters:
tolerance: which specifies how many “deviations” a point has to be away from the median for it to be considered an outlier
pct: if more than this percentage of a particular series’ points are considered outliers, then the whole series is marked to be an outlier.
Here is MAD with a tolerance of 3 and pct of 20 in action when comparing the average system load by availability zone:
The tolerance parameter should be tuned depending on the expected variability of the data. For example, if the data is generally within a small range of values, then this should be small. On the other hand, if points can vary greatly, then you want a higher scale so these variabilities do not trigger a false positive.
So which algorithm should you use? For most outliers, any algorithm performs well at the default settings. However, there are subtle cases where one algorithm is more appropriate.
In the following image, a group of hosts is flushing their buffers together, while one host is flushing its buffer slightly later. DBSCAN picks this up as an outlier whereas MAD does not. This is a case where you might prefer to use MAD, as you don’t care about when the buffers get flushed.
The synchronization of the group is just an artifact of the hosts being restarted at the same time. On the other hand, if instead of flushed buffers, the metrics below represented a scheduled job that actually should be synchronized across hosts, DBSCAN would be the right choice.
DBSCAN and MAD have scaled versions, called ScaledDBSCAN and ScaledMAD, respectively. In most situations, the scaled algorithms behaves the same as their regular counterparts. However, if DBSCAN/MAD algorithms are identifying outliers within a closely clustered group of metrics, and you would like the outlier detection algorithm to scale with the overall magnitude of the metrics, try the scaled algorithms.
Here is a comparison of DBSCAN and ScaledDBSCAN with tolerances of 3 on field data size in a group of Elasticsearch nodes:
Here is an example of MAD and ScaledMAD algorithms for comparing the usable memory in Cassandra hosts. Both have tolerances of 3 and pct of 20:
When setting up an outlier alert, an important parameter is the size of the time window. If the window size is too large, by the time an outlier is detected, the bad behavior might have been going on for longer than one would like. If the window size is too short, the alerts are not as resilient to unimportant, one-off spikes.
Both algorithms are set up to identify outliers that differ from the majority of metrics that are behaving similarly. If your hosts exhibit “banding” behavior as shown below (perhaps because each band represents a different shard), we recommend tagging each band with an identifier, and setting up outlier detection alerts on each band separately.