I've rewritten this code as something I think is more comprehensible. The OP was comparing b
with d
and suchlike, and I've changed all the symbols to more distinct ASCII characters. The result is equivalent to that of the OP's original code
I've briefly checked manually all of the regex patterns, but I don't see a discrepancy
#! /usr/local/bin/perl
use strict;
use warnings qw/ all FATAL /;
use List::Util 'max';
my @tests = (
[ vvOHvXcvv => qr/ ^ ( (v*) O | H? (v*) X )* c \2 $ /x ],
[ vvOvXcvv => qr/ ^ ( (v*) O | H? (v*) X )* c \2 $ /x ],
[ vvXHvXcvv => qr/ ^ ( (v*) X | H? (v*) X )* c \2 $ /x ],
[ vvXvXcvv => qr/ ^ ( (v*) X | H? (v*) X )* c \2 $ /x ],
[ vvOHvXcvv => qr/ ^ ( (v*) [XO] | H? (v*) X )* c \2 $ /x ],
[ vvOvXcvv => qr/ ^ ( (v*) [XO] | H? (v*) X )* c \2 $ /x ],
[ vvXHvXcvv => qr/ ^ ( (v*) [XO] | H? (v*) X )* c \2 $ /x ],
[ vvXvXcvv => qr/ ^ ( (v*) [XO] | H? (v*) X )* c \2 $ /x ],
);
my $w1 = max map length $_->[0], @tests;
my ($no, $yes) = ( 'MATCHES', "doesn't match" );
my $w2 = max map length, $no, $yes;
for my $test ( @tests ) {
my ( $str, $re ) = @$test;
printf "%-*s %-*s %s\n",
$w1+2, qq{"$str"},
$w2, $str =~ $re ? 'MATCHES' : "doesn't match",
$re;
}
"vvOHvXcvv" MATCHES (?^x: ^ ( (v*) O | H? (v*) X )* c \2 $ )
"vvOvXcvv" MATCHES (?^x: ^ ( (v*) O | H? (v*) X )* c \2 $ )
"vvXHvXcvv" MATCHES (?^x: ^ ( (v*) X | H? (v*) X )* c \2 $ )
"vvXvXcvv" doesn't match (?^x: ^ ( (v*) X | H? (v*) X )* c \2 $ )
"vvOHvXcvv" doesn't match (?^x: ^ ( (v*) [XO] | H? (v*) X )* c \2 $ )
"vvOvXcvv" doesn't match (?^x: ^ ( (v*) [XO] | H? (v*) X )* c \2 $ )
"vvXHvXcvv" doesn't match (?^x: ^ ( (v*) [XO] | H? (v*) X )* c \2 $ )
"vvXvXcvv" doesn't match (?^x: ^ ( (v*) [XO] | H? (v*) X )* c \2 $ )
The following Perl program tests a few strings against various regex patterns that use back-references. It illustrates a behaviour that I cannot understand.
The $snum
and $rnum
variables are used only to number the strings and patterns in the output for easier reading. The only thing worth reading is the contents of the @test
array.
#! /usr/local/bin/perl -w
use strict;
use warnings;
my @test = (
[ "aadeabcaa", qr/^((a*)d|e?(a*)b)*c\2$/ ],
[ "aadabcaa", qr/^((a*)d|e?(a*)b)*c\2$/ ],
[ "aabeabcaa", qr/^((a*)b|e?(a*)b)*c\2$/ ],
[ "aababcaa", qr/^((a*)b|e?(a*)b)*c\2$/ ],
[ "aadeabcaa", qr/^((a*)[bd]|e?(a*)b)*c\2$/ ],
[ "aadabcaa", qr/^((a*)[bd]|e?(a*)b)*c\2$/ ],
[ "aabeabcaa", qr/^((a*)[bd]|e?(a*)b)*c\2$/ ],
[ "aababcaa", qr/^((a*)[bd]|e?(a*)b)*c\2$/ ],
);
my %snum;
my %rnum;
my $lsnum;
my $lrnum;
for ( my $i = 0 ; $i < scalar(@test); $i++ ) {
my $t = $test[$i]; my $s = $t->[0]; my $r = $t->[1];
my $snum = ($snum{$s} //= $lsnum++);
my $rnum = ($rnum{$r} //= $lrnum++);
my $match = ($s =~ $r);
print "test $i: (S$snum) $s" .
($match?" MATCHES ":" DOES NOT match ") .
"(R$rnum) $r\n";
}
test 0: (S0) aadeabcaa MATCHES (R0) (?^:^((a*)d|e?(a*)b)*c\2$)
test 1: (S1) aadabcaa MATCHES (R0) (?^:^((a*)d|e?(a*)b)*c\2$)
test 2: (S2) aabeabcaa MATCHES (R1) (?^:^((a*)b|e?(a*)b)*c\2$)
test 3: (S3) aababcaa DOES NOT match (R1) (?^:^((a*)b|e?(a*)b)*c\2$)
test 4: (S0) aadeabcaa DOES NOT match (R2) (?^:^((a*)[bd]|e?(a*)b)*c\2$)
test 5: (S1) aadabcaa DOES NOT match (R2) (?^:^((a*)[bd]|e?(a*)b)*c\2$)
test 6: (S2) aabeabcaa DOES NOT match (R2) (?^:^((a*)[bd]|e?(a*)b)*c\2$)
test 7: (S3) aababcaa DOES NOT match (R2) (?^:^((a*)[bd]|e?(a*)b)*c\2$)
Note that egrep
(or at any rate, GNU egrep
) thinks that every test above is a match.
I think that is the theoretically "correct" answer if regexp disjunction is interpreted as a non-deterministic choice, in the sense that there exists a choice of alternatives that will make the match succeed.
Also note that (S2
, S3
, R1
) are obtained by replacing b
for d
everywhere in (S0
, S1
, R0
), which is another reason to think that the fourth test should be a match.
Intuitively, I would also like tests 4–7 to be matches insofar as tests 0–3 are.
I can sort of understand how one would arrive at the fourth test not matching: by trying the left branch and the right right branch in this order at each disjunction, if backtracking does not correctly restore the \2
variable to its prior value, exploring the left branch of the R1 disjunction on the latter ab
substring of S3 would clobber \2
to a
which would then not be backtracked to its aa
value, causing the match to fail (whereas the same thing would not happen in any of the previous tests).
But I have no idea whether my analysis is correct. Why the fifth test doesn't match really escapes me.
So anyway, my question is a combination of the following:
Can someone explain Perl's regexp engine behavior on those examples in detail?
Is this behavior intentional? Is it documented somewhere?
Should I file a bug?
There is an even easier example of the difference between egrep and Perl:
grep -iE '^(([ab])|([ab]))*\2$' <<< abA
abA
perl -wE 'say for shift =~ /^(([ab])|([ab]))*\2$/i' abA
Interestingly, the following matches in Perl (and egrep, too):
grep -iE '^(([ab])|([ab]))*(\3)$' <<< abA
abA
perl -wE 'say for shift =~ /^(([ab])|([ab]))*(\3)$/i' abA
b
b
a
A
So, the first a
is matched by the first iteration of *
, b
is matched by the second one (because \1 eq 'b'
). At the same time, \3 eq 'a'
, but \4 eq 'A'
. Why is \3 eq 'a'
? It seems to be a result of the previous iteration of the *
, which I'd say is a bug.
Update: I've reported a bug.
Let's take a go at the fourth example. (Please don't number them from zero! I people, not computer!)
vvXvXcvv
doesn't match
qr/ ^ (
(v*) X
|
H? (v*) X
)* c \2 $ /x
At the beginning of the string, perl matches the first of the two alternatives. vvX
matches (v*) X
so there is no need to try the alternative. That also saves capture 2 asvv
That leaves vXcvv
for the engine to match
Again, perl uses vX
to match (v*) X
. It saves capture 2 as v
and the engine goes around for another try
That leaves cvv
The only options left are another iteration of ( (v*) X | H? (v*) X )*
, or falling out of that loop into c \2
The text doesn't start with v
, X
, or H
so the loop ends, and the next match is c \2
, and the regex engine matches the c
Now there is only vv
to match
perl is now looking for a match to capture 2, which is v
. That succeeds
The remaining string is just v
Now perl is looking for $
, which is the end of a string, or just before a newline at the end of a string. It sees v
and so it fails
I really hope that helps. I'm not in a hurry to explain the remaining four examples and I can't yet see why there is confusion
I haven't experimented with egrep
, and I'm surprised that it behaves differently. Maybe it doesn't stack the captures like Perl does?
Please let me know if there's anything of further interest
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With