Regular Expression Syntax

Overview

In Sensitive Data Scanner, the scanning rule determines what sensitive information to match within the data. You can use rules from the Scanning Rule Library or you can create custom scanning rules using regular expression (regex) patterns to scan for sensitive information. The Sensitive Data Scanner regex syntax is a subset of PCRE2.


Alternation

Use alternation to choose the first expression that matches. An expression in an alteration can be left empty, which means it matches to anything and makes the entire alternation expression optional.

Regex SyntaxDescription
...|...|...An alternation.
...|...|An alternation with an empty expression.

Assertions

AssertionDescription
\bA word boundary.
\BNot a word boundary.
^Start of a line.
$End of a line.
/AStart of text.
\zEnd of text.
\ZEnd of text (or before a \n that is immediately before the end of the text).

ASCII classes

Names classes that can be used in custom character classes, for example [[:ascii:]]. These only match ASCII characters.

Names ClassDescription
alnumAlphanumeric.
alphaAlphabetic.
asciiAny ASCII character.
blankA space or tab.
cntrlA control character.
digitAny digit.
graphAny graphical or printing character (not a space).
lowerAny lowercase letter.
printAny printable character (including spaces).
punctAny punctuation character.
spaceA whitespace.
upperAny uppercase letter.
wordThe same as /w.
xdigitAny hexadecimal digit.

Character escapes

Regex SyntaxDescription
\xhhEscapes characters with hex code hh (up to 2 digits are allowed).
\x{hhhhhh}Escapes characters with hex code hhhhhh (between 1 and 6 digits).
\aEscapes a bell (\x{7}).
\bEscapes a backspace (\x{8}). This only works in a custom character class (for example, [\b]), otherwise it is treated as a word boundary.
\cxEscapes a control sequence, where x is A-Z (upper or lowercase). For example: \cA = \x{0}, \cB = \x{1},…\cZ = \x{19}.
\eEscapes the ASCII escape character (\x{1B}).
\fEscapes a form feed (\x{C}).
\nEscapes a newline (\x{A}).
\rEscapes a carriage return (\x{D})
\tEscapes a tab (\x{9}).
\vEscapes a vertical tab (\x{B}).

Character classes

Regex SyntaxDescription
.Matches any character except \n. Enable the s flag to match any character, including \n.
\dMatches any ASCII digit ([0-9]).
\DMatches anything that does not match with \d.
\hMatches a space or tab ([\x{20}\t]).
\HMatches anything that does not match with \h.
\sMatches any ASCII whitespace ([\r\n\t\x{C}\x{B}\x{20}]).
\SMatches anything that does not match with \s.
\vMatches ASCII vertical space ([\x{B}\x{A}\x{C}\x{D}]).
\VMatches anything that does not match with \v.
\wMatches any ASCII word character ([a-zA-Z0-9_]).
\WMatches anything that does not match with \w.
\p{x}Matches anything that matches the unicode property x. See Unicode Properties for a full list.

Custom character classes

Regex SyntaxDescription
[...]Matches any character listed inside the brackets.
[^...]Matches anything that is not listed inside the brackets.
[a-zA-Z]Matches anything in the range A - Z (upper or lowercase).
[\s\w\d\S\W\D\v\V\h\H\p{x}...]Other classes defined above are allowed (except . which is treated as a literal).
[[:ascii_class:]]Matches special named ASCII classes.
[[:^ascii_class:]]Matches inverted ASCII classes.

Groups

Use groups to change precedence or set flags. Since captures are not used in Sensitive data Scanner, capturing groups behave like non-capturing groups. Similarly, capture group names are ignored.

Regex SyntaxDescription
(...)A capture group.
(?<name>...)A named capture group.
(?P<name>...)A named capture group.
(?'name'...)A named capture group.
(?:...)A non-capturing group.

Setting flags

Use flags to modify the regex behavior. There are two ways to specify flags:

  1. (?imsx:...): Set flags that only apply to the expression inside of a non-capturing group.
  2. (?imsx)...: Set flags that apply to the rest of the current group.

Flags listed after a - are removed if they were previously set.

Use (?-imsx) to unset the imsx flags.

Available flags

FlagNameDescription
iCase insensitiveLetters match both upper and lower case.
mMulti-line mode^ and $ match the beginning and end of line.
sSingle lineAllows . to match any character, when it usually matches anything except \n).
xExtendedWhitespace is ignored (except in a custom character class).

Quoting

Use the regex syntax \Q...\E to treat everything between \Q and \E as a literal.

Quantifiers

Quantifiers repeat the previous expression. Greedy means that the most number of repetitions are taken, and are only given back as needed to find a match. Lazy takes the minimum number of repetitions and adds more as needed.

Regex SyntaxDescription
?Repeat 0 or 1 time (greedy).
??Repeat 0 or 1 time (lazy).
+Repeat 1 or more times (greedy).
+?Repeat 1 or more times (lazy).
*Repeat 0 or more times (greedy).
*?Repeat 0 or more times (lazy).
{n}Repeat exactly n times (the lazy modifier is accepted here but is ignored).
{n,m}Repeat at least n times, but no more than m times (greedy).
{n,m}?Repeat at least n times, but no more than m times (lazy).
{n,}Repeat at least n times (greedy).
{n,}?Repeat at least n times (lazy).

Note: {,m}is not valid and is treated as a literal. Similarly, any syntax differences such as adding spaces inside the braces cause the quantifier to be treated as a literal instead.

Unicode properties

Unicode properties for x in the character class \p{x}.

Unicode PropertiesDescription
COther
CcControl
CfFormat
CnUnassigned
CoPrivate use
CsSurrogate
LLetter
LlLowercase letter
LmModifier letter
LoOther letter
LtTitle case letter
LuUppercase letter
MMark
McSpacing mark
MeEnclosing mark
MnNon-spacing mark
NNumber
NdDecimal number
NlLetter number
NoOther number
PPunctuation
PcConnector punctuation
PdDash punctuation
PeClose punctuation
PfFinal punctuation
PiInitial punctuation
PoOther punctuation
PsOpen punctuation
SSymbol
ScCurrency symbol
SkModifier symbol
SmMathematical symbol
SoOther symbol
ZSeparator
ZlLine separator
ZpParagraph separator
ZsSpace separator

Script names can be used to match any character from the script. The following are allowed:

Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Albanian, Chakma, Cham, Cherokee, Chorasmian, Common, Coptic, Cuneiform, Cypriot, Cypro_Minoan, Cyrillic, Deseret, Devanagari, Dives_Akuru, Dogra, Duployan, Egyptian_Hieroglyphs, Elbasan, Elymaic, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khitan_Small_Script, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Mahajani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi, Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar, Nabataean, Nandinagari, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sogdian, Old_South_Arabian, Old_Turkic, Old_Uyghur, Oriya, Osage, Osmanya, Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Tangsa, Tangut, Telugu, Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Toto, Ugaritic, Vai, Vithkuqi, Wancho, Warang_Citi, Yezidi, Yi, Zanabazar_Square.

Further Reading

Additional helpful documentation, links, and articles: