PERL Regex Scratch Pad

Class, alteration etc

  • [--] -> Character class

  • (--) -> Grouping, Capturing, Constraining alteration.

  • | -> Alternation - Think of this as a set of rules which you need to apply on a string. The rules are "OR"ed i.e. Only one of the rules need to pass. PERL has ordered alternation i.e. the regex applied are in order as it is written.

  • \1,\2 -> Back reference for using the value from (--) i.e. use in search section

  • (?:..) -> Non-Capturing parentheses

  • \f -> ASCII form feed

  • \b -> Backspace and word boundary

  • /i -> Character invariant modifier

  • /g -> Global modifier

  • /f -> Form free expression

  • \w -> This meta character matches a-z A-Z 0-9 _

  • /x -> This modifier converts search part of s/..// to multiline by changing most of whitespaces and anything followed by # to invalid metacharacters

Greedy quantifiers

  • * -> Wildcard for 0 or more. Always successful match even without any match

  • ? -> Wildcard for 0 or 1 match

  • + -> Wildcard for 1 or more match

Lazy quantifiers

  • *? -> Wildcard for 0 or more. Always successful match even without any match

  • ?? -> Wildcard for 0 or 1 match

  • +? -> Wildcard for 1 or more match

Possesive quantifiers

  • *+ -> Wildcard for 0 or more. Always successful match even without any match

  • ?+ -> Wildcard for 0 or 1 match

  • ++ -> Wildcard for 1 or more match

Anchors

  • ^ $ -> Matches start and end position of a string and not any character

By default the ^ and $ matches the start and end of string. It ignores any embedded end of line characters in it.

Assertions

  • (?=..) -> Look ahead assertion

This feature just finds the location upon match and don't consume any character i.e. they don't move the REGEX pointer

  • (?<=..) -> Look behind assertion

This feature just finds the location upon match and don't consume any character i.e. they don't move the REGEX pointer. Perl 5.xx do not support variable length look behind assertion.

  • (?!..) -> Negative look ahead

  • (?<!..) -> Negative look behind

Multi line mode (/m) vs Single line mode (/s):

The mode altering bits changes how the dot operator [.], Carrot [^] and Sigil [$] works. (Slurp and entire file and work on these modifier to get better view)


Code example:

my $str = 'abc\ndef\nghi\n';

$str =~ s/^.*$/xyz/m; -> Output: $str -> 'xyz\ndef\nghi'

$str =~ s/^.*$/xyz/s; -> Output: $str -> 'xyz'


Explanation:

/m -> The multi-line mode makes the ^ and $ to match consider the embedded end of line (\n) characters. So the $str in the above example will be seen as '^abc$\n^def\n$^ghi\n$'. So it broken down into 3 sub-strings and regex is applied on the first one. Using /g it can be extended to match all 3 sub-strings.



/s -> The single line mode keeps ^ and $ the same but affects how the match all [.] operator works. By default the [.] operator do not match a new line character but in /s mode it does. So the $str in the above example will be seen as '^abc\ndef\nghi\n$'. The [.] operator matches \n goes till the end of string.

PERL Regex Debug

  • PERL -Mre=debug -e'<Regex pattern>' -> To debug a regular expression

PERL Regex NFA explained

PERL uses NFA (Nondeterministic Finite Automaton) based engine. In simple terms the NFA engine match is governed by the EXPRESSION (while the DFA uses the STRING to govern the search).


Example:

my $str = "aabbcc";

$str =~ m/(a+)/;


Here, STRING: aabbcc

EXPRESSION: (a+)


Note: Here $1 contains what is matched in-between ()


Match procedure:

The match always start from the left. Since it's a NFA, the EXPRESSION is governs the match i.e. the engine picks up the EXPRESSION and matches it against the STRING (while DFA uses the STRING to match against EXPRESSION. Though "STRING used on EXPRESSION" and "EXPRESSION used on STRING" sounds similar, there are variations in how the engine holds the state space which will be discussed in DFA vs NFA)


Instance 1:

STRING : aabbcc

EXPR : a+

STRING PTR: ^aabbcc

EXPR PTR : >a+


Instance 2:

STRING : aabbcc

EXPR : a+

STRING PTR: a^abbcc

EXPR PTR : >a+


Instance 3:

STRING : aabbcc

EXPR : a+

STRING PTR: aa^bbcc

EXPR PTR : >a+


Instance 4: (Match ends)

STRING : aabbcc

EXPR : a+

STRING PTR: aa^bbcc

EXPR PTR : a+>


At instance 4, the match ends with "$1" holding "aa"

PERL Regex greediness, laziness and backtracking

By default all quantifiers (*, +, ?) are greedy in nature i.e. If + is used, though it means 1 or more it will always try to match as much as possible.


E.g.:

my $str = "aaaabbcc";

$str =~ m/(a+)/;


In the above example, though after match the first "a", the match is qualified to be SUCCESS, the quantifier keeps going for more. So at the end of the match "$1" contains "aaaa".


While the quantifier is greedy in nature, it'll always work towards having a successful match. Even if it means giving up few of it's own matches. This process is called backtracking and is an important aspect of NFA.


E.g.:

my $str = "aabbcc";

$str =~ m/(a+)ab/;


Initial:

STRING : aabbcc

EXPR : a+

STRING PTR: ^aabbcc

EXPR PTR : >(a+)

STATE : No state space is saved as "+" needs at-least 1 match


Instance 1:

STRING : aabbcc

EXPR : a+

STRING PTR: a^abbcc

EXPR PTR : >(a+)

STATE : Saved at positions a^abbcc


Instance 2:

STRING : aabbcc

EXPR : a+

STRING PTR: aa^bbcc

EXPR PTR : >(a+)

STATE : Saved at positions aa^bbcc


Instance 3: Failure

STRING : aabbcc

EXPR : a

STRING PTR: aa^bbcc (The pointer won't move as the match failed)

EXPR PTR : >a

STATE : Saved at positions aa^bbcc (State won't be moved further as it failed)


Since instance 3 is a failure, the engine moves back to last saved state i.e. a^abbcc.

This causes the a+ to relinquish one of the captured "a" with $1 becoming "a" from "aa"

This procedure of moving back to a successful saved state is called backtracking.


Instance 5:

STRING : aabbcc

EXPR : a

STRING PTR: a^abbcc

EXPR PTR : (a+)a

STATE : Saved at positions aa^bbcc


Instance 6:

STRING : aabbcc

EXPR : b

STRING PTR: aa^bbcc

EXPR PTR : (a+)ab

STATE : Saved at positions aab^bcc


Since this is the end of EXPR the match is a success with $1 holding "a".

PERL Regex atomic grouping and possessive quantifier

Special chars and escape

Unlike the traditional programs, PERL expects certain characters to perform special operation by default.

If these characters needs to be matched (i.e. used as a search string and not as a operator) then a backslash needs to be preceded.

Characters to be escaped outside character class [...]

. * ? + ^ $ [ { ( ) | \

1. Quantifier ---> * ? +

2. Alternation ---> |

3. Anchors ---> ^ $

4. Grouping ---> ( )

5. Braces ---> [ {

6. Back slash ---> \

Characters to be escaped inside character class [...]

- ^ \ ]

1. Range operator ? ---> -

2. Anchor ---> ^

3. Back slash ---> \

4. Char class end ---> ]

REGEX Capture groups

PERL by design do not reset the captured groups created by grouping construct (...)

So a failed match still keeps the captured groups from previous match.


E.g.:

my $a = "99";

my $b = "0";

$a =~ m/(\d)(\d)/;

print "First = $1, Second = $2 \n";


# case 1

$b =~ m/(\d)(\d)/;

print "Case 1: First = $1, Second = $2 \n ";


# case 2

$b =~ m/(\d)(\d)/;

if (defined $2) {

print "Case 2: First = $1, Second = $2 \n ";

}


# case 3

if ($b =~ m/(\d)(\d)/){

print "First = $1, Second = $2 \n ";

} else {

print "Case 3: Pattern not present \n";

}


RESULT:

First = 9, Second = 9

Case 1: First = 9, Second = 9

Case 2: First = 9, Second = 9

Case 3: Pattern not present


PERLDOC:

http://perldoc.perl.org/perlre.html#Capture-groups

Possessive quantifiers

PERL offers three different type of matching. Greedy, lazy and possessive. Possessive quatifier are enabled using a "+" sign. For example:

"aaaxybbbcccx" =~ /^\w++xy/


Possessive quantifier are an optimized version of greedy match, where it loses all its backtracking information. So once a match start and goes forward, there is no backtracking to find further matches. The only option is for the REGEX engine to move its start pointer and start all over again.


The above example always fails, as the "\w++" always matches up the entire string and with no backtracking infromation, "xy" will never be matched. Just so there is no confusion, even below case fails too.

"aaaxybbbcccx" =~ /^\w++x/


Example of usage of possessive quatifier:

One of the most useful case of possessive quatifier is extracting balanced branckets.

$text =~ m/ # Match start

( # Start capture group 1

\{ # Start matching "{"

(?: # Start non-capture group

[\{\}]++ | (?1) # Dont match "{}", if done move

# to capature group 1 (recursion)

)* # Match 0 or more times

\}. # Match closing first bracket

) # Stop capture group 1

/x; # /x allows clean regex writing.