Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I would like explanation of the behaviour of Perl's regular expression engine

Tags:

regex

perl

Update by @Borodin

I've rewritten this code as something I think is more comprehensible. The OP was comparing b with d and suchlike, and I've changed all the symbols to more distinct ASCII characters. The result is equivalent to that of the OP's original code

I've briefly checked manually all of the regex patterns, but I don't see a discrepancy

#! /usr/local/bin/perl

use strict;
use warnings qw/ all FATAL /;

use List::Util 'max';

my @tests = (
    [ vvOHvXcvv => qr/ ^ ( (v*) O    | H? (v*) X )* c \2 $ /x ],
    [ vvOvXcvv  => qr/ ^ ( (v*) O    | H? (v*) X )* c \2 $ /x ],
    [ vvXHvXcvv => qr/ ^ ( (v*) X    | H? (v*) X )* c \2 $ /x ],
    [ vvXvXcvv  => qr/ ^ ( (v*) X    | H? (v*) X )* c \2 $ /x ],
    [ vvOHvXcvv => qr/ ^ ( (v*) [XO] | H? (v*) X )* c \2 $ /x ],
    [ vvOvXcvv  => qr/ ^ ( (v*) [XO] | H? (v*) X )* c \2 $ /x ],
    [ vvXHvXcvv => qr/ ^ ( (v*) [XO] | H? (v*) X )* c \2 $ /x ],
    [ vvXvXcvv  => qr/ ^ ( (v*) [XO] | H? (v*) X )* c \2 $ /x ],
);

my $w1 = max map length $_->[0], @tests;
my ($no, $yes) = ( 'MATCHES', "doesn't match" );
my $w2 = max map length, $no, $yes;

for my $test ( @tests ) {
    my ( $str, $re ) = @$test;

    printf "%-*s %-*s %s\n",
            $w1+2, qq{"$str"},
            $w2, $str =~ $re ? 'MATCHES' : "doesn't match",
            $re;
}

output

"vvOHvXcvv" MATCHES       (?^x: ^ ( (v*) O    | H? (v*) X )* c \2 $ )
"vvOvXcvv"  MATCHES       (?^x: ^ ( (v*) O    | H? (v*) X )* c \2 $ )
"vvXHvXcvv" MATCHES       (?^x: ^ ( (v*) X    | H? (v*) X )* c \2 $ )
"vvXvXcvv"  doesn't match (?^x: ^ ( (v*) X    | H? (v*) X )* c \2 $ )
"vvOHvXcvv" doesn't match (?^x: ^ ( (v*) [XO] | H? (v*) X )* c \2 $ )
"vvOvXcvv"  doesn't match (?^x: ^ ( (v*) [XO] | H? (v*) X )* c \2 $ )
"vvXHvXcvv" doesn't match (?^x: ^ ( (v*) [XO] | H? (v*) X )* c \2 $ )
"vvXvXcvv"  doesn't match (?^x: ^ ( (v*) [XO] | H? (v*) X )* c \2 $ )



The following Perl program tests a few strings against various regex patterns that use back-references. It illustrates a behaviour that I cannot understand.

The $snum and $rnum variables are used only to number the strings and patterns in the output for easier reading. The only thing worth reading is the contents of the @test array.

#! /usr/local/bin/perl -w

use strict;
use warnings;

my @test = (
    [ "aadeabcaa", qr/^((a*)d|e?(a*)b)*c\2$/ ],
    [ "aadabcaa", qr/^((a*)d|e?(a*)b)*c\2$/ ],
    [ "aabeabcaa", qr/^((a*)b|e?(a*)b)*c\2$/ ],
    [ "aababcaa", qr/^((a*)b|e?(a*)b)*c\2$/ ],
    [ "aadeabcaa", qr/^((a*)[bd]|e?(a*)b)*c\2$/ ],
    [ "aadabcaa", qr/^((a*)[bd]|e?(a*)b)*c\2$/ ],
    [ "aabeabcaa", qr/^((a*)[bd]|e?(a*)b)*c\2$/ ],
    [ "aababcaa", qr/^((a*)[bd]|e?(a*)b)*c\2$/ ],
);

my %snum;
my %rnum;
my $lsnum;
my $lrnum;

for ( my $i = 0 ; $i < scalar(@test); $i++ ) {

    my $t = $test[$i];  my $s = $t->[0];  my $r = $t->[1];

    my $snum = ($snum{$s} //= $lsnum++);
    my $rnum = ($rnum{$r} //= $lrnum++);

    my $match = ($s =~ $r);

    print "test $i: (S$snum) $s" .
        ($match?" MATCHES ":" DOES NOT match ") .
        "(R$rnum) $r\n";
}

output

test 0: (S0) aadeabcaa MATCHES (R0) (?^:^((a*)d|e?(a*)b)*c\2$)
test 1: (S1) aadabcaa MATCHES (R0) (?^:^((a*)d|e?(a*)b)*c\2$)
test 2: (S2) aabeabcaa MATCHES (R1) (?^:^((a*)b|e?(a*)b)*c\2$)
test 3: (S3) aababcaa DOES NOT match (R1) (?^:^((a*)b|e?(a*)b)*c\2$)
test 4: (S0) aadeabcaa DOES NOT match (R2) (?^:^((a*)[bd]|e?(a*)b)*c\2$)
test 5: (S1) aadabcaa DOES NOT match (R2) (?^:^((a*)[bd]|e?(a*)b)*c\2$)
test 6: (S2) aabeabcaa DOES NOT match (R2) (?^:^((a*)[bd]|e?(a*)b)*c\2$)
test 7: (S3) aababcaa DOES NOT match (R2) (?^:^((a*)[bd]|e?(a*)b)*c\2$)

Note that egrep (or at any rate, GNU egrep) thinks that every test above is a match.

I think that is the theoretically "correct" answer if regexp disjunction is interpreted as a non-deterministic choice, in the sense that there exists a choice of alternatives that will make the match succeed.

Also note that (S2, S3, R1) are obtained by replacing b for d everywhere in (S0, S1, R0), which is another reason to think that the fourth test should be a match.

Intuitively, I would also like tests 4–7 to be matches insofar as tests 0–3 are.

I can sort of understand how one would arrive at the fourth test not matching: by trying the left branch and the right right branch in this order at each disjunction, if backtracking does not correctly restore the \2 variable to its prior value, exploring the left branch of the R1 disjunction on the latter ab substring of S3 would clobber \2 to a which would then not be backtracked to its aa value, causing the match to fail (whereas the same thing would not happen in any of the previous tests).

But I have no idea whether my analysis is correct. Why the fifth test doesn't match really escapes me.

So anyway, my question is a combination of the following:

  • Can someone explain Perl's regexp engine behavior on those examples in detail?

  • Is this behavior intentional? Is it documented somewhere?

  • Should I file a bug?

like image 385
Gro-Tsen Avatar asked May 22 '16 15:05

Gro-Tsen


2 Answers

There is an even easier example of the difference between egrep and Perl:

grep -iE '^(([ab])|([ab]))*\2$' <<< abA
abA
perl -wE 'say for shift =~ /^(([ab])|([ab]))*\2$/i' abA

Interestingly, the following matches in Perl (and egrep, too):

grep -iE '^(([ab])|([ab]))*(\3)$' <<< abA
abA
perl -wE 'say for shift =~ /^(([ab])|([ab]))*(\3)$/i' abA
b
b
a
A

So, the first a is matched by the first iteration of *, b is matched by the second one (because \1 eq 'b'). At the same time, \3 eq 'a', but \4 eq 'A'. Why is \3 eq 'a'? It seems to be a result of the previous iteration of the *, which I'd say is a bug.

Update: I've reported a bug.

like image 82
choroba Avatar answered Oct 16 '22 04:10

choroba


Let's take a go at the fourth example. (Please don't number them from zero! I people, not computer!)

vvXvXcvv

doesn't match

qr/ ^ (
    (v*) X
    |
    H? (v*) X
)* c \2 $ /x
  • At the beginning of the string, perl matches the first of the two alternatives. vvX matches (v*) X so there is no need to try the alternative. That also saves capture 2 asvv

    That leaves vXcvv for the engine to match

  • Again, perl uses vX to match (v*) X. It saves capture 2 as v and the engine goes around for another try

    That leaves cvv

  • The only options left are another iteration of ( (v*) X | H? (v*) X )*, or falling out of that loop into c \2

  • The text doesn't start with v, X, or H so the loop ends, and the next match is c \2, and the regex engine matches the c

    Now there is only vv to match

  • perl is now looking for a match to capture 2, which is v. That succeeds

    The remaining string is just v

  • Now perl is looking for $, which is the end of a string, or just before a newline at the end of a string. It sees v and so it fails

I really hope that helps. I'm not in a hurry to explain the remaining four examples and I can't yet see why there is confusion

I haven't experimented with egrep, and I'm surprised that it behaves differently. Maybe it doesn't stack the captures like Perl does?

Please let me know if there's anything of further interest

like image 44
Borodin Avatar answered Oct 16 '22 03:10

Borodin