Why does Perl backtracking match failure seem to take less time than match success?

Question

I have a huge file aab.txt whose contents are aaa...aab.

To my great surprise

perl -ne '/a*bb/' < aab.txt

runs (match failure) faster than

perl -ne '/a*b/' < aab.txt

(match success). Why???? Both should first gobble up all the a's, then the second one immediately succeeds, while the first will then have to backtrack over and over again, to fail.

amon · Accepted Answer

Perl regexes are optimized to rather fail as early as possible, than to succeed as fast as possible. This makes a lot of sense when grepping through a large log file.

There is an optimization that first looks for a constant part of the string, in this case, a “floating” b or bb. This can be checked rather efficiently without having to keep track of backtracking state. No bb is found, and the match aborts right there.

Not so with b. That floating substring is found, and the match constructed from there. Here is the debug output of the regex match (program is "aaab" =~ /a*b/):

Compiling REx "a*b"
synthetic stclass "ANYOF_SYNTHETIC[ab][]".
Final program:
   1: STAR (4)
   2:   EXACT <a> (0)
   4: EXACT <b> (6)
   6: END (0)
floating "b" at 0..2147483647 (checking floating) stclass ANYOF_SYNTHETIC[ab][] minlen 1 
Guessing start of match in sv for REx "a*b" against "aaab"
Found floating substr "b" at offset 3...
start_shift: 0 check_at: 3 s: 0 endpos: 4 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "a*b" against "aaab"
Matching stclass ANYOF_SYNTHETIC[ab][] against "aaab" (4 bytes)
   0 <> <aaab>               |  1:STAR(4)
                                  EXACT <a> can match 3 times out of 2147483647...
   3 <aaa> <b>               |  4:  EXACT <b>(6)
   4 <aaab> <>               |  6:  END(0)
Match successful!
Freeing REx: "a*b"

You can get such output with the debug option for the re pragma.

Finding the b or bb is unnecessary, strictly speaking, but it allows the match to fail much earlier.

Why does Perl backtracking match failure seem to take less time than match success?

Tags:

regex

perl

Mark Galeck

1 Answers

amon

Recent Activity

Donate For Us

Why does Perl backtracking match failure seem to take less time than match success?

Tags:

regex

perl

Mark Galeck

1 Answers

amon

Related questions

Recent Activity

Donate For Us