Content Anomaly

Docs > Datadog Security > Cloud SIEM > Detect and Monitor > Custom Detection Rules > Content Anomaly

Overview

Content anomaly detection analyzes incoming logs to identify and alert on anomalous log content. You can set anomaly detection parameters to trigger signals if a log’s field values significantly deviate from historical logs within a group. A significant deviation is when the similarity between incoming and historical values is low or there is no similarity at all. See How logs are determined to be anomalous for more information.

See Create Rule for instructions on how to configure a content anomaly rule.

How content anomaly detection works

Anomaly detection parameters

When you create a rule with the content anomaly detection method, you can set the following parameters.

Learning duration

A content anomaly rule's query with the leaning duration setting highlighted

Description: Time window when values are learned. No signals are generated during this phase. The learning period restarts if the rule is modified.
Default: 7 days
Range: 1-10 days
How to configure: When you edit a content anomaly rule, you can set the learning duration in the query’s Learning for dropdown menu.

Forget after

Content anomaly detection options with the within the last days dropdown menu highlighted

Description: How long learned values are retained before being discarded.
Default: 7 days
Range: 1-10 days
How to configure: In the Content anomaly detection options section of a rule’s setting page, you can set how long learned values are retained in the within in the last dropdown menu.

Similarity percentage threshold

Content anomaly detection options with the similarity percentage dropdown menu highlighted

Description: Minimum similarity required to consider a log as normal.
Default: 70%
Range: 35-100%
How to configure: In the Content anomaly detection options section of a rule’s setting page, you can set the similarity percentage threshold in the within in the last dropdown menu.

Evaluation Window

Content anomaly detection options with the similar items dropdown menu highlighted

Description: Defines the time frame for counting anomalous logs. Signals are triggered if anomalies exceed the case condition (for example, a >= 2).
Range: 0-24 hours
How to configure: In the Set conditions section of a rule’s setting page, you can set a conditiions’s evaluation window in the within a window of dropdown menu.

How logs are determined to be anomalous

Logs are tokenized using Unicode Text Segmentation (UTS #29).
Tokens are compared using Jaccard similarity.
Efficient comparisons are achieved with MinHash and Locality Sensitive Hashing (LSH).
A log is anomalous if it fails both similarity percentage and similar items threshold.

Jaccard similarity computation examples

Cloud SIEM uses the Jaccard similarity to compare logs.

$$\text"J(A,B)" = {∣\text"A" ∩ \text"B"∣} / {∣\text"A" ∪ \text"B"∣}$$

The following are examples of how Jaccard similarity is calculated for logs with single-word fields and logs with multi-word fields.

Single-word fields

These are two example logs with single-word fields:

log1={actionType:auth, resourceType:k8s, networkType:public, userType:swe}

log2={actionType:auth, resourceType:k8s, networkType:public, userType:pm}

To calculate the Jaccard similarity between the two logs:

The intersection of log1 and log2 results in this set of words: {auth, k8s, public}.
The union of log1 and log2 results in this set of words: {auth, k8s, public, swe, pm}.
The Jaccard similarity is calculated using the number of words in the results:

$$\text"J(log1,log2)" = 3 / 5 = 0.6$$

Multi-word fields

These are two example logs with multi-word fields:

log1={actionDescription: "User connected to abc network"}

log2={actionDescription: "User got unauthorized network access"}