Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

latest Perl won't match certain regexes more than 32768 characters long

Tags:

regex

perl

I am hoping some Perl gurus can opine on the following. This is the smallest possible example I could find that reproduces my problem:

>./perl -e 'print (("a".("f"x32767)."a") =~ /a(?:[^a]|bb)*a/)'
1

but

>./perl -e 'print (("a".("f"x32768)."a") =~ /a(?:[^a]|bb)*a/)'
>

and I did compile the latest Perl from source, just to see if it would fix the problem:

>./perl -v

This is perl 5, version 20, subversion 1 (v5.20.1) built for i686-linux

Is this a bug (looks like to me)?

like image 523
Mark Galeck Avatar asked Oct 07 '14 00:10

Mark Galeck


2 Answers

This is a known bug reported since 2002 and it has yet to be fixed. You now know you are not the first person to encounter this bug (or feature, as you will see soon).

From this comment in the bug report, it seems that quantifiers (*, +, {n,m}, {n,}) are designed to have an upper limit on the number of repetitions, which prevents the engine from segfault when the stack used for backtracking overflows, but violates the very definition of Kleene operator in regular expression (repeat the pattern for arbitrary number of times) and gives wrong answer for the query1.

1 In contrast, Java's regex engine (Oracle's implemetation) simply allow StackOverflowError to occur for cases like this, but the quantifier has the upper limit of 232 - 1, which is sufficient for most use case. And there exists a workaround for cases like this, which is to use possessive quantifier.

The same comment also print the regex compilation debugging info, and the output clearly shows that * is translated to {0,32767}. It is also reproducible on my machine (perl v5.10.1 (*) built for x86_64-linux-thread-multi).

$ perl -Mre=debug -wce '/(A|B)*/'
Compiling REx "(A|B)*"
Final program:
   1: CURLYM[1] {0,32767} (15)
   5:   TRIE-EXACT[AB] (13)
        <A>
        <B>
  13:   SUCCEED (0)
  14: NOTHING (15)
  15: END (0)
minlen 0
-e syntax OK
Freeing REx: "(A|B)*"

This following test further confirms the problem, and it shows that perl doesn't let you specify a repetition that exceeds the limit.

$ perl -e 'print (("a".("f"x32767)."a") =~ /a(?:[^a]|bb){0,32767}a/)'
Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/a(?:[^a]|bb){ <-- HERE 0,32767}a/ at -e line 1.

Making the quantifier possessive *+ does not solve the problem, since the limit is still there:

$ perl -Mre=debug -wce '/(A|B)*+/'
Compiling REx "(A|B)*+"
Final program:
   1: SUSPEND (19)
   3:   CURLYM[1] {0,32767} (17)
   7:     TRIE-EXACT[AB] (15)
          <A>
          <B>
  15:     SUCCEED (0)
  16:   NOTHING (17)
  17:   SUCCEED (0)
  18: TAIL (19)
  19: END (0)
minlen 0
-e syntax OK
Freeing REx: "(A|B)*+"
like image 98
nhahtdh Avatar answered Nov 18 '22 13:11

nhahtdh


Add use warnings:

use strict;
use warnings;
use feature qw(say);

say "Version is $^V";

say +("a".("f"x32767)."a") =~ /a(?:[^a]|bb)*a/ ? 'matches' : 'no match';

say +("a".("f"x32768)."a") =~ /a(?:[^a]|bb)*a/ ? 'matches' : 'no match';

Outputs:

Complex regular subexpression recursion limit (32766) exceeded at e.pl line 7.
Complex regular subexpression recursion limit (32766) exceeded at e.pl line 9.
Version is v5.20.1
matches
no match

From perldiag

Complex regular subexpression recursion limit (%d) exceeded

(W regexp) The regular expression engine uses recursion in complex situations where back-tracking is required. Recursion depth is limited to 32766, or perhaps less in architectures where the stack cannot grow arbitrarily. ("Simple" and "medium" situations are handled without recursion and are not subject to a limit.) Try shortening the string under examination; looping in Perl code (e.g. with while ) rather than in the regular expression engine; or rewriting the regular expression so that it is simpler or backtracks less. (See perlfaq2 for information on Mastering Regular Expressions.)

like image 11
Miller Avatar answered Nov 18 '22 13:11

Miller