Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Perl backtracking match failure seem to take less time than match success?

Tags:

regex

perl

I have a huge file aab.txt whose contents are aaa...aab.

To my great surprise

perl -ne '/a*bb/' < aab.txt

runs (match failure) faster than

perl -ne '/a*b/' < aab.txt

(match success). Why???? Both should first gobble up all the a's, then the second one immediately succeeds, while the first will then have to backtrack over and over again, to fail.

like image 521
Mark Galeck Avatar asked Oct 24 '13 23:10

Mark Galeck


1 Answers

Perl regexes are optimized to rather fail as early as possible, than to succeed as fast as possible. This makes a lot of sense when grepping through a large log file.

There is an optimization that first looks for a constant part of the string, in this case, a “floating” b or bb. This can be checked rather efficiently without having to keep track of backtracking state. No bb is found, and the match aborts right there.

Not so with b. That floating substring is found, and the match constructed from there. Here is the debug output of the regex match (program is "aaab" =~ /a*b/):

Compiling REx "a*b"
synthetic stclass "ANYOF_SYNTHETIC[ab][]".
Final program:
   1: STAR (4)
   2:   EXACT <a> (0)
   4: EXACT <b> (6)
   6: END (0)
floating "b" at 0..2147483647 (checking floating) stclass ANYOF_SYNTHETIC[ab][] minlen 1 
Guessing start of match in sv for REx "a*b" against "aaab"
Found floating substr "b" at offset 3...
start_shift: 0 check_at: 3 s: 0 endpos: 4 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "a*b" against "aaab"
Matching stclass ANYOF_SYNTHETIC[ab][] against "aaab" (4 bytes)
   0 <> <aaab>               |  1:STAR(4)
                                  EXACT <a> can match 3 times out of 2147483647...
   3 <aaa> <b>               |  4:  EXACT <b>(6)
   4 <aaab> <>               |  6:  END(0)
Match successful!
Freeing REx: "a*b"

You can get such output with the debug option for the re pragma.

Finding the b or bb is unnecessary, strictly speaking, but it allows the match to fail much earlier.

like image 172
amon Avatar answered Sep 25 '22 20:09

amon