PERL Regex Scratch Pad
Class, alteration etc
[--] -> Character class
(--) -> Grouping, Capturing, Constraining alteration.
| -> Alternation - Think of this as a set of rules which you need to apply on a string. The rules are "OR"ed i.e. Only one of the rules need to pass. PERL has ordered alternation i.e. the regex applied are in order as it is written.
\1,\2 -> Back reference for using the value from (--) i.e. use in search section
(?:..) -> Non-Capturing parentheses
\f -> ASCII form feed
\b -> Backspace and word boundary
/i -> Character invariant modifier
/g -> Global modifier
/f -> Form free expression
\w -> This meta character matches a-z A-Z 0-9 _
/x -> This modifier converts search part of s/..// to multiline by changing most of whitespaces and anything followed by # to invalid metacharacters
Greedy quantifiers
* -> Wildcard for 0 or more. Always successful match even without any match
? -> Wildcard for 0 or 1 match
+ -> Wildcard for 1 or more match
Lazy quantifiers
*? -> Wildcard for 0 or more. Always successful match even without any match
?? -> Wildcard for 0 or 1 match
+? -> Wildcard for 1 or more match
Possesive quantifiers
*+ -> Wildcard for 0 or more. Always successful match even without any match
?+ -> Wildcard for 0 or 1 match
++ -> Wildcard for 1 or more match
Anchors
^ $ -> Matches start and end position of a string and not any character
By default the ^ and $ matches the start and end of string. It ignores any embedded end of line characters in it.
Assertions
(?=..) -> Look ahead assertion
This feature just finds the location upon match and don't consume any character i.e. they don't move the REGEX pointer
(?<=..) -> Look behind assertion
This feature just finds the location upon match and don't consume any character i.e. they don't move the REGEX pointer. Perl 5.xx do not support variable length look behind assertion.
(?!..) -> Negative look ahead
(?<!..) -> Negative look behind
Multi line mode (/m) vs Single line mode (/s):
The mode altering bits changes how the dot operator [.], Carrot [^] and Sigil [$] works. (Slurp and entire file and work on these modifier to get better view)
Code example:
my $str = 'abc\ndef\nghi\n';
$str =~ s/^.*$/xyz/m; -> Output: $str -> 'xyz\ndef\nghi'
$str =~ s/^.*$/xyz/s; -> Output: $str -> 'xyz'
Explanation:
/m -> The multi-line mode makes the ^ and $ to match consider the embedded end of line (\n) characters. So the $str in the above example will be seen as '^abc$\n^def\n$^ghi\n$'. So it broken down into 3 sub-strings and regex is applied on the first one. Using /g it can be extended to match all 3 sub-strings.
/s -> The single line mode keeps ^ and $ the same but affects how the match all [.] operator works. By default the [.] operator do not match a new line character but in /s mode it does. So the $str in the above example will be seen as '^abc\ndef\nghi\n$'. The [.] operator matches \n goes till the end of string.
PERL Regex Debug
PERL -Mre=debug -e'<Regex pattern>' -> To debug a regular expression
PERL Regex NFA explained
PERL uses NFA (Nondeterministic Finite Automaton) based engine. In simple terms the NFA engine match is governed by the EXPRESSION (while the DFA uses the STRING to govern the search).
Example:
my $str = "aabbcc";
$str =~ m/(a+)/;
Here, STRING: aabbcc
EXPRESSION: (a+)
Note: Here $1 contains what is matched in-between ()
Match procedure:
The match always start from the left. Since it's a NFA, the EXPRESSION is governs the match i.e. the engine picks up the EXPRESSION and matches it against the STRING (while DFA uses the STRING to match against EXPRESSION. Though "STRING used on EXPRESSION" and "EXPRESSION used on STRING" sounds similar, there are variations in how the engine holds the state space which will be discussed in DFA vs NFA)
Instance 1:
STRING : aabbcc
EXPR : a+
STRING PTR: ^aabbcc
EXPR PTR : >a+
Instance 2:
STRING : aabbcc
EXPR : a+
STRING PTR: a^abbcc
EXPR PTR : >a+
Instance 3:
STRING : aabbcc
EXPR : a+
STRING PTR: aa^bbcc
EXPR PTR : >a+
Instance 4: (Match ends)
STRING : aabbcc
EXPR : a+
STRING PTR: aa^bbcc
EXPR PTR : a+>
At instance 4, the match ends with "$1" holding "aa"
PERL Regex greediness, laziness and backtracking
By default all quantifiers (*, +, ?) are greedy in nature i.e. If + is used, though it means 1 or more it will always try to match as much as possible.
E.g.:
my $str = "aaaabbcc";
$str =~ m/(a+)/;
In the above example, though after match the first "a", the match is qualified to be SUCCESS, the quantifier keeps going for more. So at the end of the match "$1" contains "aaaa".
While the quantifier is greedy in nature, it'll always work towards having a successful match. Even if it means giving up few of it's own matches. This process is called backtracking and is an important aspect of NFA.
E.g.:
my $str = "aabbcc";
$str =~ m/(a+)ab/;
Initial:
STRING : aabbcc
EXPR : a+
STRING PTR: ^aabbcc
EXPR PTR : >(a+)
STATE : No state space is saved as "+" needs at-least 1 match
Instance 1:
STRING : aabbcc
EXPR : a+
STRING PTR: a^abbcc
EXPR PTR : >(a+)
STATE : Saved at positions a^abbcc
Instance 2:
STRING : aabbcc
EXPR : a+
STRING PTR: aa^bbcc
EXPR PTR : >(a+)
STATE : Saved at positions aa^bbcc
Instance 3: Failure
STRING : aabbcc
EXPR : a
STRING PTR: aa^bbcc (The pointer won't move as the match failed)
EXPR PTR : >a
STATE : Saved at positions aa^bbcc (State won't be moved further as it failed)
Since instance 3 is a failure, the engine moves back to last saved state i.e. a^abbcc.
This causes the a+ to relinquish one of the captured "a" with $1 becoming "a" from "aa"
This procedure of moving back to a successful saved state is called backtracking.
Instance 5:
STRING : aabbcc
EXPR : a
STRING PTR: a^abbcc
EXPR PTR : (a+)a
STATE : Saved at positions aa^bbcc
Instance 6:
STRING : aabbcc
EXPR : b
STRING PTR: aa^bbcc
EXPR PTR : (a+)ab
STATE : Saved at positions aab^bcc
Since this is the end of EXPR the match is a success with $1 holding "a".
PERL Regex atomic grouping and possessive quantifier
Special chars and escape
Unlike the traditional programs, PERL expects certain characters to perform special operation by default.
If these characters needs to be matched (i.e. used as a search string and not as a operator) then a backslash needs to be preceded.
Characters to be escaped outside character class [...]
. * ? + ^ $ [ { ( ) | \
1. Quantifier ---> * ? +
2. Alternation ---> |
3. Anchors ---> ^ $
4. Grouping ---> ( )
5. Braces ---> [ {
6. Back slash ---> \
Characters to be escaped inside character class [...]
- ^ \ ]
1. Range operator ? ---> -
2. Anchor ---> ^
3. Back slash ---> \
4. Char class end ---> ]
REGEX Capture groups
PERL by design do not reset the captured groups created by grouping construct (...)
So a failed match still keeps the captured groups from previous match.
E.g.:
my $a = "99";
my $b = "0";
$a =~ m/(\d)(\d)/;
print "First = $1, Second = $2 \n";
# case 1
$b =~ m/(\d)(\d)/;
print "Case 1: First = $1, Second = $2 \n ";
# case 2
$b =~ m/(\d)(\d)/;
if (defined $2) {
print "Case 2: First = $1, Second = $2 \n ";
}
# case 3
if ($b =~ m/(\d)(\d)/){
print "First = $1, Second = $2 \n ";
} else {
print "Case 3: Pattern not present \n";
}
RESULT:
First = 9, Second = 9
Case 1: First = 9, Second = 9
Case 2: First = 9, Second = 9
Case 3: Pattern not present
PERLDOC:
http://perldoc.perl.org/perlre.html#Capture-groups
Possessive quantifiers
PERL offers three different type of matching. Greedy, lazy and possessive. Possessive quatifier are enabled using a "+" sign. For example:
"aaaxybbbcccx" =~ /^\w++xy/
Possessive quantifier are an optimized version of greedy match, where it loses all its backtracking information. So once a match start and goes forward, there is no backtracking to find further matches. The only option is for the REGEX engine to move its start pointer and start all over again.
The above example always fails, as the "\w++" always matches up the entire string and with no backtracking infromation, "xy" will never be matched. Just so there is no confusion, even below case fails too.
"aaaxybbbcccx" =~ /^\w++x/
Example of usage of possessive quatifier:
One of the most useful case of possessive quatifier is extracting balanced branckets.
$text =~ m/ # Match start
( # Start capture group 1
\{ # Start matching "{"
(?: # Start non-capture group
[\{\}]++ | (?1) # Dont match "{}", if done move
# to capature group 1 (recursion)
)* # Match 0 or more times
\}. # Match closing first bracket
) # Stop capture group 1
/x; # /x allows clean regex writing.