Overview

Content anomaly detection analyzes incoming logs to identify and alert on anomalous log content. You can set anomaly detection parameters to trigger signals if a log’s field values significantly deviate from historical logs within a group. A significant deviation is when the similarity between incoming and historical values is low or there is no similarity at all. See How logs are determined to be anomalous for more information.

See Create Rule for instructions on how to configure a content anomaly rule.

How content anomaly detection works

Anomaly detection parameters

When you create a rule with the content anomaly detection method, you can set the following parameters.

Learning duration

A content anomaly rule's query with the leaning duration setting highlighted
  • Description: Time window when values are learned. No signals are generated during this phase. The learning period restarts if the rule is modified.
  • Default: 7 days
  • Range: 1-10 days
  • How to configure: When you edit a content anomaly rule, you can set the learning duration in the query’s Learning for dropdown menu.

Forget after

Content anomaly detection options with the within the last days dropdown menu highlighted
  • Description: How long learned values are retained before being discarded.
  • Default: 7 days
  • Range: 1-10 days
  • How to configure: In the Content anomaly detection options section of a rule’s setting page, you can set how long learned values are retained in the within in the last dropdown menu.

Similarity percentage threshold

Content anomaly detection options with the similarity percentage dropdown menu highlighted
  • Description: Minimum similarity required to consider a log as normal.
  • Default: 70%
  • Range: 35-100%
  • How to configure: In the Content anomaly detection options section of a rule’s setting page, you can set the similarity percentage threshold in the within in the last dropdown menu.

Similar items threshold

Content anomaly detection options with the similar items dropdown menu highlighted
  • Description: Number of matching historical logs required for an incoming value to be considered normal.
  • Default: 1
  • Range: 1-20
  • How to configure: In the Content anomaly detection options section of a rule’s setting page, you can enter the similar items threshold in the with more than field.

Evaluation Window

Content anomaly detection options with the similar items dropdown menu highlighted
  • Description: Defines the time frame for counting anomalous logs. Signals are triggered if anomalies exceed the case condition (for example, a >= 2).
  • Range: 0-24 hours
  • How to configure: In the Set conditions section of a rule’s setting page, you can set a conditiions’s evaluation window in the within a window of dropdown menu.

How logs are determined to be anomalous

  1. Logs are tokenized using Unicode Text Segmentation (UTS #29).
  2. Tokens are compared using Jaccard similarity.
  3. Efficient comparisons are achieved with MinHash and Locality Sensitive Hashing (LSH).
  4. A log is anomalous if it fails both similarity percentage and similar items threshold.

Jaccard similarity computation examples

Cloud SIEM uses the Jaccard similarity to compare logs.

$$\text"J(A,B)" = {∣\text"A" ∩ \text"B"∣} / {∣\text"A" ∪ \text"B"∣}$$

The following are examples of how Jaccard similarity is calculated for logs with single-word fields and logs with multi-word fields.

Single-word fields

These are two example logs with single-word fields:

log1={actionType:auth, resourceType:k8s, networkType:public, userType:swe}
log2={actionType:auth, resourceType:k8s, networkType:public, userType:pm}

To calculate the Jaccard similarity between the two logs:

  • The intersection of log1 and log2 results in this set of words: {auth, k8s, public}.
  • The union of log1 and log2 results in this set of words: {auth, k8s, public, swe, pm}.
  • The Jaccard similarity is calculated using the number of words in the results:

$$\text"J(log1,log2)" = 3 / 5 = 0.6$$

Multi-word fields

These are two example logs with multi-word fields:

log1={actionDescription: "User connected to abc network"}
log2={actionDescription: "User got unauthorized network access"}

To calculate the Jaccard similarity between the two logs:

  • The intersection of log1 and log2 results in this set of words: {User, network}.
  • The union of log1 and log2 results in this set of words: {User, connected, to, abc, network, got, unauthorized, access}.
  • The Jaccard similarity is calculated using the number of elements in the results:

$$\text"J(log1,log2)" = 2 / 8 = 0.25$$

Comparing content anomaly method with other detection methods

FeatureAnomaly DetectionNew Value DetectionContent Anomaly Detection
Detects new field valuesNoYesYes (configurable)
Detects rare field valuesNoNoYes
Detects dissimilar valuesNoNoYes
Detects log spikesYesNoNo
Multiple queries supportedNoNoYes
Multiple cases supportedNoNoYes
Threshold definition for triggering signalsLearned from the log count distribution per time bucket (~99th percentile).Always triggers a signal on the first occurrence of a new value.User-specified (1-100)
Evaluation windowYesNoYes
Retention14 days30 days10 days