Wilson Mar bio photo

Wilson Mar

Hello!

Calendar YouTube Github

LinkedIn

Don’t avoid them. Don’t fear them. Make them your friend. Here and now.

US (English)   Norsk (Norwegian)   Español (Spanish)   Français (French)   Deutsch (German)   Italiano   Português   Estonian   اَلْعَرَبِيَّةُ (Egypt Arabic)   Napali   中文 (简体) Chinese (Simplified)   日本語 Japanese   한국어 Korean

Overview

JOKE: ‘Some people, when confronted with a problem, think, “I know, I’ll use regular expressions.” Now they have two problems.’ - J. Zawinski

NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.

Why Do We Care About Regex?

The term regular expression is often abbreviated as “regex” or “regexes” in plural.

A regular expression is a “formula” for matching strings that follow some pattern in order to operate on a subject character string.

Text in HTML, log files, text files containing data, etc. are parsed in order to validate for correct formatting, to extract substrings, or to replace content.

RegEx parsing is used by code scanners to identify patterns of coding that may be vulnerable to hacking (see OWASP).

The Perl ("Practical Extraction and Report Language") language became popular partly because of its extensive support for regular expressions. Perl allows embedding of regular expressions in file tests, control loops, output formats, etc.

Different RegEx Engine Flavors

Beware that vendor competitive urges has resulted in several versions of regular expressions:

  1. regex101.com lists the different “flavors” of RegEx engines:

    • PCRE v2

    • PCRE v1 is what Splunk uses (at time of writing).

    • ECMAScript (used in JavaScript)

    • Golang

    • .NET (C#)

    • Rust

    • The historical Simple Basic Regular Expression (BRE) notation, described as part of the regexp() function in the XSH specification, which provide backward compatibility, but which may be withdrawn from a future specification set.

    • The GNU operating system’s regex package are available using ftp from ftp.gnu.org.

    • Compilers of programming languages Perl, Python, Emacs, Tcl, and .NET use a backtracking regular expression matcher that incorporates a traditional Nondeterministic Finite Automaton (NFA) engine. So the standardized POSIX NFAs is slower.

    • Utility programs initially developed for unix – awk, egrep, and lex – use a faster, but more limited, pure regular expression Deterministic Finite Automaton (DFA) engine.

    • The Extended Regular Expressions (ERE) version complies with the internationalized ISO/IEC 9945-2:1993 standard. It matches based on the bit pattern used for encoding the character, not on the graphic representation of the character (which may represent more than a one bit pattern).

    • Microsoft’s .NET Framework regular expressions are said to be compatible with Perl 5 regular expressions, but include features not yet seen in other implementations, such as right-to-left matching and on-the-fly compilation.

    NOTE: Parsing C/C++ style comments are a little more complex when you have to take into account string embedding, escaping, and line continuation. For example, the match routine of the C language library, accepts strings that are interpreted as regular expressions.

Try It Now

TIP: The easiest way to learn this is to take a hands-on approach and manually work through some patterns.

Test and debug regular expressions using these tools:

TOOL: Download or clone RegexExplained and see it used by its author @LeaVerou at

VIDEO: /Reg(exp){2}lained/: Demystifying Regular Expressions presented live at the O’Reilly Fluent conference May 2012.

TOOL: RegexPal.com parses JavaScript on a web page.

TOOL: Use the Regex Coach to graphically experiment with (Perl-compatible) regular expressions interactively. Dr. Edmund Weitz wrote this for use on Windows and Linux systems to show how Common Lisp can be practical using the LispWorks IDE and cross-platform CAPI toolkit.

TOOL: Regular Expression Tester parses within ASP.NET.

TOOL: $40 RegExBuddy is a Windows program.

Regex Patterns

Instead of custom-written coding (looping through each line and invoking sub-string functions), regex methods refer to a pattern of characters to vary its searching and matching.

This video shows how files containing different date formats can’t be parsed using just the sub-string function alone, which is a dangeroudly blunt tool.

Patterns comprises two basic character types available from a standard keyboard (not using Greek alphas, lambdas, etc. like mathematicians do):

  • literal (normal) text characters such as 0 thru 9 or a thru z; and
  • Metacharacters specify filtering. enabling a powerful, flexible, and efficient method for processing text. However, their compactness make them easier to create than to read.

JOKE: Some call regex expressions “ASCII puke” because it looks like a jumble of letters and numbers.

The Kleene Star * (Wild Card) Metacharacter

VIDEO: The development of regular expressions is first traced back to the work during the 1950’s by Kleene (some pronounce like “clean knee”, not “clean”) – Stephen Cole Kleene (1904-1994), an American mathematician and theoretical computer scientist at Princeton and U. of Wisconsin-Madison.

For this reason, the “*” wildcard character used in computer searches is formally known as a "Kleene star."

The use of < and > enclosing text is formally known as a "Kleene closure".

Kleene’s text-manipulation tools used by the Unix platform include ed, vi text editor, and grep file search utilities made used his notations for “the algebra of regular sets.”

Basic Metacharacters

There are 12 of them.

Meta-
character
Operator
Name
MatchesExample regular expression
. period any single character except NUL. r.t would match the strings rat, rut, r t, but not root (two o's) nor the Rot in Rotten (upper case R).
* Kleene star>, asterisk, wildcard zero or more occurences of the character immediately preceding. .* means match any number of any characters. 
$ dollar currency anchor end of a line. weasel$ would match the end of the string "He's a weasel" but not the string"They are a bunch of weasels." When the $ operator is the last operator of a regular expression or immediately follows a right parenthesis, it must be proceeded by a backslash \.
^ circumflex or caret anchor beginning of a string/line. ^When in would match the beginning of the string "When in the course of human events" but would not match "What and When in the"
[ ]
[c 1 -c 2 ]
[^c 1 -c 2]
square brackets any one of the characters between the brackets. r[aou]t matches rat, rot, and rut, but not ret. Ranges of characters can specified by using a hyphen. For example, the regular expression [0-9] means match any digit. Multiple ranges can be specified as well. The regular expression [A-Za-z] means match any upper or lower case letter. To match any character except those in the range, the complement range, use the caret as the first character after the opening bracket. For example, the expression [^269A-Z] matches any characters except 2, 6, 9, and upper case letters.
[^c 1 -c 2 ] caret within square brackets the complement range -- any character except those in the range following the caret as the first character after the opening bracket. [^269A-Z] will match any characters except 2, 6, 9, and upper case letters.
When the ^ operator is the first operator of a regular expression or the first character inside brackets, it must be preceded by a backslash.
\ back slash This is the quoting character, use it to treat the following character as an ordinary character. For example, \$ is used to match the dollar sign character ($) rather than the end of a line. Similarly, the expression \. is used to match the period character rather than any single character.

Operators inside brackets do not need to be preceded by a backslash.
\< \> left slash and arrow the beginning (\<) or end (\>) or a word. \<the matches on "the" in the string "for the wise" but does not match "the" in "otherwise". NOTE: this metacharacter is not supported by all applications.
\( \) left slash and parentheses the expression between \( and \) as a group. Also, saves the characters matched by the expression into temporary holding areas. Up to nine pattern matches can be saved in a single regular expression. They can be referenced as \1 through \9.
| pipe (alternation) Or two conditions together. (him|her) matches the line "it belongs to him" and matches the line "it belongs to her" but does not match the line "it belongs to them." NOTE: this metacharacter is not supported by all applications.
+ plus sign one or more occurences of the character or regular expression immediately preceding. 9+ matches 9, 99, or 999.
NOTE: this metacharacter is not supported by all applications.
\{ i \}
\ { i , j \}
braces a specific number of instances or instances within a range of the preceding character. A[0-9]\{3\} will match "A" followed by exactly 3 digits. That is, it will match A123 but not A1234.
[0-9]\{4,6\} matches any sequence of 4, 5, or 6 digits. NOTE: this metacharacter is supported by Robot's C-VU language but not by all applications.
? question mark Matches 0 or 1 occurence of the character or regular expression immediately preceding. ? is equivalent to {0,1}. NOTE: this metacharacter is not supported by all applications. Question marks are optionally used to specify Non-greedy quantifiers. For example, "/A[A-Z]*?B/" means "match an A, followed by only as many capital letters as are needed to find a B."

In addition, VU regular expressions can include ASCII control characters in the range 0 to 7F hex (0 to 127 decimal).

PROTIP: Regex processes only ASCII character set and does not process Unicode (UTF-8).

Backward Slash Extended MetaCharacters

One of the ways people are confused with regular expressions is the use of a backward slash
character.

For an analogy that you many already know, in Windows command line terminals, people use dir *.txt /s to look for text files in subdirectories. The asterisk or star character is a wildcard. The /s specifies processing of sub-folders.

With regex, the same parsing would be specified by .*.txt, with a back-slash in front of the dot for the escape character for the dot before txt since the dot has another meaning within regex expressions.

The dot character . is used in regex to represent any one character.

Backreferences provide a convenient way to find repeating groups of characters. They can be thought of as a shorthand instruction to match the same string again.

Extended

Liks C and Java programs, regex programs use \ as an escape character to denote use of special characters as plain text. These additional escape tags are recognized within Ruby regex:

\A Beginning of a string
\b Word boundary
\B Non-word boundary
\d digit, same as {0..9}
\D Non-digit
\s Whitespace [\t\r\n]
\S Non-Whitespace
\w Word character
\W Non-Word character
\z End of a string
\Z End of string, before nl

[10:00] To specify digits (numbers) [0-9]:

\d
   

[10:48] To specify letters, numbers, and underscore, use shortcut:

\w
   

[14:34] To match hex codes containing 3 or 6 numbers of hex code in CSS color specification such as #abc, #f00, #BADA55, #C0FE56

/^#[a-f\d]{3}){1,2}$/i.test(str);
   

This matches letters between a-f or a digit {3} times, repeated {1,2} once or twice.

Double Backslash Regex in LoadRunner

The double backslash is required in C language programs invoking regex because both C and regex “consume” a backslash as an escape character.

LoadRunner has this function which creates a parameter named “selected_value”:

char *str = " ... the html text here ...";
 
web_reg_save_param_regexp(
   "ParamName=selected_value",
   "RegExp=<select name=\"Regulatory Code_0\"[\\s\\S]*?<option .*? selected>(.*?)</option>",
   LAST );
   

The [\s\S] means match any white space and any non white space character = any character (because no Perl like “s” modifier available).

Introduced with VuGen 12 is a new function:

char *str = " ... the html text here ...";
 
lr_save_param_regexp(str, strlen(str),
   "RegExp=... the regex here ...",
   "ResultParam=selected_value",
   LAST);
   

Examples of Regular Expressions

This regular expression matches any day of the week:

((Mon)|(Tues)|(Wednes)|(Thurs)|(Fri)|(Satur)|(Sun))day

This matches simple dates against 1 or 2 digits for the month, 1 or 2 digit for the day, and either 2 or 4 digits for the year. Matches: [4/5/91], [04/5/1991], [4/05/89]
Non-Matches: [4/5/1]

((\d{2})|(\d))\/((\d{2})|(\d))\/((\d{4})|(\d{2}))

This identifies incorrect 24 hour time in the format hh:mm:

/((?:0?[0-9]|1[0-9]|2[0-3]):[0-5][0-9])/

Validate a number between 1 and 255, such as an IP octet:

^([1-9]|[1-9]\d|1\d{2}|2[0-4]\d|25[0-5])$

This breaks down a Uniform Resource Identifier (URI) into its component parts. (from ActiveState quoting Appendix B of IETF RFC 2396)

my $uri = "http://www.ics.uci.edu/pub/ietf/uri/#Related";
print "$1, $2, $3, $4, $5, $6, $7, $8, $9" if
  $uri =~ m{^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?};
   

$1 = http: $2 = http (the scheme) $3 = //www.ics.uci.edu $4 = www.ics.uci.edu (the authority) $5 = /pub/ietf/uri/ (the path) $6 = $7 = (the query) $8 = #Related $9 = Related (the fragment)

Validate an ip address in the form 255.255.255.255 – if it were combined with the email pattern above, the error above would not exist. Of course, the best way to test an email address is to send e-mail to it:

^([a-zA-Z0-9_\-])+(\.([a-zA-Z0-9_\-])+)*@((\[(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5]))\]))|((([a-zA-Z0-9])+(([\-])+([a-zA-Z0-9])+)*\.)+([a-zA-Z])+(([\-])+([a-zA-Z0-9])+)*))$

Validates date in the US m/d/y format from 1/1/1600 - 12/31/9999. The days are validated for the given month and year. Leap years are validated for all 4 digits years from 1600-9999, and all 2 digits years except 00 since it could be any century (1900, 2000, 2100). Days and months must be 1 or 2 digits and may have leading zeros. Years must be 2 or 4 digit years. 4 digit years must be between 1600 and 9999. Date separator may be a slash (/), dash (-), or period (.)

^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[1,3-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

Validate passwords to be at least 4 characters, no more than 8 characters, and must include at least one upper case letter, one lower case letter, and one numeric digit.

^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{4,8}$

Validate major credit card numbers from Visa (length 16, prefix 4), Mastercard (length 16, prefix 51-55), Discover (length 16, prefix 6011), American Express (length 15, prefix 34 or 37). All 16 digit formats accept optional hyphens (-) between each group of four digits.

^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$

This will Use extended grep for a valid MAC address, such as [01:23:45:67:89:ab], [01:23:45:67:89:AB], [fE:dC:bA:98:76:54] with colons seperating octets. It will ignore strings too short or long, or with invalid characters, such as [01:23:45:67:89:ab:cd], [01:23:45:67:89:Az], [01:23:45:56:]. It will accept mixed case hexadecimal.

^([0-9a-fA-F][0-9a-fA-F]:){5}([0-9a-fA-F][0-9a-fA-F])$

This matches the name of any state in the United States:

[ACF-IK-PR-W][a-y]{2,4}[a-y][CDIJMVY]?[a-z]{0,7}

But you probably use a drop-down list rather than making people type them out.

This Perl script (from Craig Berry) uses a pattern to validate British Royal Mail codes used in the UK. Each code has 2 parts: the outward (first) part cannot contain any character in “CIKMOV.”

use strict;
my @patterns = ('AN NAA', 'ANN NAA', 'AAN NAA', 'AANN NAA',
                'ANA NAA', 'AANA NAA', 'AAA NAA');
foreach (@patterns) {
  s/A/[A-Z]/g;
  s/N/\\d/g;
  s/ /\\s?/g;
}
my $re = join '|', @patterns;
while (<>) {
  print /^(?:$re)$/o ? "valid\n" : "invalid\n";
}
   

Alternately, the RegEx:

(AB|B|BA|BB|BD|BH|BL|BN|BR|BS|BT|CA|CB|CF|CH|CM|CO|CR|CT|CV|CW|DA|DE|DG|DH|DL|DN|DT|DY|E|EC|EH|EN|EX|FK|FY|G|GL|GU|H|HG|HP|HR|HS|HU|HX|IG|IM|IP|IV|KA|KT|KW|KY|L|LA|LD|LE|LL|LN|LS|LU|M|ME|MK|MK|N|NE|NG|NN|NP|NR|NW|OL|OX|PA|PE|PH|PL|PO|PR|RG|RH|RM|S|SA|SE|SG|SK|SL|SM|SN|SO|SP|SR|SS|ST|SW|SY|TA|TD|TF|TN|TQ|TR|TS|TW|UB|W|WA|WC|WD|WF|WN|WR|WS|WV|Y|ZE)([1-9]|[1-9][0-9]) [1-9][A-Z]{2}

The RegEx for verifying Canadian postal codes:

[ABCEGHJKLMNOPRSTWXYZ]{1}[ABCEGHJKLMNPRSTWXYZ]{1}[ ][0-9]{2}[ ][0-9]{2}[ ][0-9]{2}[ ][ABCD]{1}

This matches any hexadecimal number with a decimal value of 1 to 4 digits in the range 0 to 65535:

[a-fA-F0-9]{1,4}

STAR: Visibone’s FREE Regular Expressions detailed cheatsheet provides examples for JavaScript.

$30 regexbuddy allows you to easily create, understand and test regex patterns for C# and VB.NET. It includes a library of expressions.

TOOL: Altova.com XML Regular Expressions Edit Regular Exp’s for XML Schema XML Editor,

BOOK: Regular Expression Recipes: A Problem-Solution Approach (APress ) by Nathan A. Good

Error Recovery with Regular Expressions

If a VU regular expression contains an error, when you run a suite, TestManager writes the message to stderr output prefixed with the following header:

sqa7vui#xxx: fatal orig type error: tname: sname, line lineno

where #xxx identifies the user ID (not present if 0), fatal signifies that error recovery is not possible (otherwise not present), orig specifies the error origination (user, system, server, or program), and type specifies the general error category (initialization, argument parsing, script initialization, or runtime). If the error occurred during execution of a script (run-time category), tname specifies the name of the script being executed when the error occurred, sname specifies the name of the VU source file that contains the VU statement causing the error, and lineno specifies the line number of this VU statement in the source file. Note that the source file information will not be available if the script’s source cross-reference section has been stripped.

If a run-time error occurs due to an improper regular expression pattern in the match library function, a diagnostic message of the following form follows the header:

Regular Expression Error = errno

where errno is an error code which indicates the type of regular expression error. The following table lists the possible errno values and explains each.

errno Explanation

2 Illegal assignment form. Character after )$ must be a digit.Example: “([0-9]+)$x”

3 Illegal character inside braces. Expecting a digit.Example: “x{1,z}”

11 Exceeded maximum allowable assignments. Only $0 through $9 are valid.Example: “([0-9]+)$10”

30 Missing operand to a range operator (? {m,n} + *).Example: “?a”

31 Range operators (? {m,n} + *) must not immediately follow a left parenthesis.Example: “(?b)”

32 Two consecutive range operators (? {m,n} + *) are not allowed.Example: “[0-9]+?”

34 Range operators (? {m,n} + *) must not immediately follow an assignment operation.Example: “([0-9]+)$0{1-4}”Correction: “(([0-9]+)$0){1-4}”

36 Range level exceeds 254.Example: “[0-9]{1-255}”

39 Range nesting depth exceeded maximum of 18 during matching of subject string.

41 Pattern must have non-zero length.Example: “”

42 Call nesting depth exceeded 80 during matching of subject string.

44 Extra comma not allowed within braces.Example: “[0-9]{3,4,}”

46 Lower range parameter exceeds upper range parameter.Example: “[0-9]{4,3}”

49 ‘\0’ not allowed within brackets, or missing right bracket.Example: “[\0] or [0-9”

55 Parenthesis nesting depth exceeds maximum of 18.Example: “(((((((((((((((((((x)))))))))))))))))))”

56 Unbalanced parentheses. More right parentheses than left parentheses.Example: “([0-9]+)$1)”

57 Program error. Please report.

70 Program error. Please report.

90 Unbalanced parentheses. More left parentheses than right parentheses.Example: “(([0-9]+)$1”

91 Program error. Please report.

100 Program error. Please report.

C# Coding Example

The C# languge provides a System.Test.RegularExpressions library:

   System.Test.RegularExpressions;
   

This provides the Regex constructor which instatiate a regex class:

      var regex = new Regex( pattern );
   

Use the Match method defined within Regex on the subject text to generate a match object:

      var match = new regex.Method( subject );
   

See what came back:

      Console.WriteLine( match.Success );
   

This code would go inside code to define a command-line program named MatchTest.exe:

CREDITS:

References

http://www.wikiwand.com/en/Regular_expression

https://dev.to/emmabostian/regex-cheat-sheet-2j2a

Play RegexCrossword.com at various levels of difficulty. Fun until you get stuck.

Regex101.com (online regex tester and reference)

RegexOne.com (online tutorials with interactive labs)

regular-expressions.ino (reference guide)

regexr.com (RegEx tester)

PROTIP: In use within Splunk or other time-series database, also specify a time range to narrow search results, which also speeds up searches. Splunk’s sidebar shows fields that it automatically extracted from events.