I have a huge file aab.txt
whose contents are aaa
...aab
.
To my great surprise
perl -ne '/a*bb/' < aab.txt
runs (match failure) faster than
perl -ne '/a*b/' < aab.txt
(match success). Why???? Both should first gobble up all the a's, then the second one immediately succeeds, while the first will then have to backtrack over and over again, to fail.
Perl regexes are optimized to rather fail as early as possible, than to succeed as fast as possible. This makes a lot of sense when grepping through a large log file.
There is an optimization that first looks for a constant part of the string, in this case, a “floating” b
or bb
. This can be checked rather efficiently without having to keep track of backtracking state. No bb
is found, and the match aborts right there.
Not so with b
. That floating substring is found, and the match constructed from there. Here is the debug output of the regex match (program is "aaab" =~ /a*b/
):
Compiling REx "a*b"
synthetic stclass "ANYOF_SYNTHETIC[ab][]".
Final program:
1: STAR (4)
2: EXACT <a> (0)
4: EXACT <b> (6)
6: END (0)
floating "b" at 0..2147483647 (checking floating) stclass ANYOF_SYNTHETIC[ab][] minlen 1
Guessing start of match in sv for REx "a*b" against "aaab"
Found floating substr "b" at offset 3...
start_shift: 0 check_at: 3 s: 0 endpos: 4 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "a*b" against "aaab"
Matching stclass ANYOF_SYNTHETIC[ab][] against "aaab" (4 bytes)
0 <> <aaab> | 1:STAR(4)
EXACT <a> can match 3 times out of 2147483647...
3 <aaa> <b> | 4: EXACT <b>(6)
4 <aaab> <> | 6: END(0)
Match successful!
Freeing REx: "a*b"
You can get such output with the debug
option for the re
pragma.
Finding the b
or bb
is unnecessary, strictly speaking, but it allows the match to fail much earlier.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With