Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

perlre length limit

Tags:

regex

perl

From man perlre:

The "*" quantifier is equivalent to "{0,}", the "+" quantifier to "{1,}", and the "?" quantifier to "{0,1}". n and m are limited to integral values less than a preset limit defined when perl is built. This is usually 32766 on the most common platforms. The actual limit can be seen in the error message generated by code such as this:

       $_ **= $_ , / {$_} / for 2 .. 42;

Ay that's ugly - Isn't there some constant I can get instead?

Edit: As daxim pointed out (and perlretut hints towards) it might be that 32767 is a magical hardcoded number. A little searching in the Perl code goes a long way, but I'm not sure how to get to the next step and actually find out where the default reg_infty or REG_INFTY is actually set:

~/dev/perl-5.12.2
$ grep -ri 'reg_infty.*=' *
regexec.c:      if (max != REG_INFTY && ST.count == max)
t/re/pat.t:        $::reg_infty   = $Config {reg_infty} // 32767;
t/re/pat.t:        $::reg_infty_m = $::reg_infty - 1;
t/re/pat.t:        $::reg_infty_p = $::reg_infty + 1;
t/re/pat.t:        $::reg_infty_m = $::reg_infty_m;   # Surpress warning.

Edit 2: DVK is of course right: It's defined at compile time, and can probably be overridden only with REG_INFTY.

like image 762
l0b0 Avatar asked Jan 04 '11 10:01

l0b0


People also ask

What is $1 Perl?

$1 equals the text " brown ".

What is \W in Perl regex?

A \w matches a single alphanumeric character (an alphabetic character, or a decimal digit) or _ , not a whole word. Use \w+ to match a string of Perl-identifier characters (which isn't the same as matching an English word).

What does \s mean in Perl?

In addition, Perl defines the following: \w Match a "word" character (alphanumeric plus "_") \W Match a non-word character \s Match a whitespace character \S Match a non-whitespace character \d Match a digit character \D Match a non-digit character.


1 Answers

Summary: there are 3 ways I can think of to find the limit: empirical, "matching Perl tests" and "theoretical".

  • Empirical:

    eval {$_ **= $_ , / {$_} / for 2 .. 129};
    # To be truly portable, the above should ideally loop forever till $@ is true.
    $@ =~ /bigger than (-?\d+) /; 
    print "LIMIT: $1\n"'
    

    This seems obvious enough that it doesn't require explanation.

  • Matches Perl tests:

    Perl has a series of tests for regex, some of which (in pat.t) deal with testing this max value. So, you can approximate that the max value computed in those tests is "good enough" and follow the test's logic:

    use Config;
    $reg_infty = $Config {reg_infty} // 2 ** 15 - 1; # 32767
    print "Test-based reg_infinity limit: $reg_infty\n";
    

    The explanation of where in the tests this is based off of is in below details.

  • Theoretical: This is attempting to replicate the EXACT logic used by C code to generate this value.

    This is harder that it sounds, because it's affected by 2 things: Perl build configuration and a bunch of C #define statements with branching logic. I was able to delve fairly deeply into that logic, but was stalled on two problems: the #ifdefs reference a bunch of tokens that are NOT actually defined anywhere in Perl code that I can find - and I don't know how to find out from within Perl what those defines values were, and the ultimate default value (assuming I'm right and those #ifdefs always end up with the default) of #define PERL_USHORT_MAX ((unsigned short)~(unsigned)0) (The actual limit is gotten by removing 1 bit off that resulting all-ones number - details below).

    I'm also not sure how to access the amount of bytes in short from Perl for whichever implementation was used to build perl executable.

    So, even if the answer to both those questions can be found (which I'm not sure of), the resulting logic would most certainly be "uglier" and more complex than the straightforward "empirical eval-based" one I offered as the first option.

Below I will provide the details of where various bits and pieces of logic related to to this limit live in Perl code, as well as my attempts to arrive at "Theoretically correct" solution matching C logic.


OK, here is some investigation part way, you can complete it yourself as I have ti run or I will complete later:

  • From regcomp.c: vFAIL2("Quantifier in {,} bigger than %d", REG_INFTY - 1);

    So, the limit is obviously taken from REG_INFTY define. Which is declared in:

  • rehcomp.h:

     /* XXX fix this description.
        Impose a limit of REG_INFTY on various pattern matching operations
        to limit stack growth and to avoid "infinite" recursions.
     */
     /* The default size for REG_INFTY is I16_MAX, which is the same as
        SHORT_MAX (see perl.h).  Unfortunately I16 isn't necessarily 16 bits
        (see handy.h).  On the Cray C90, sizeof(short)==4 and hence I16_MAX is
        ((1<<31)-1), while on the Cray T90, sizeof(short)==8 and I16_MAX is
        ((1<<63)-1).  To limit stack growth to reasonable sizes, supply a
        smaller default.
             --Andy Dougherty  11 June 1998
     */
     #if SHORTSIZE > 2
     #  ifndef REG_INFTY
     #    define REG_INFTY ((1<<15)-1)
     #  endif
     #endif
     #ifndef REG_INFTY
     #  define REG_INFTY I16_MAX
     #endif
    

    Please note that SHORTSIZE is overridable via Config - I will leave details of that out but the logic will need to include $Config{shortsize} :)

  • From handy.h (this doesn't seem to be part of Perl source at first glance so it looks like an iffy step):

     #if defined(UINT8_MAX) && defined(INT16_MAX) && defined(INT32_MAX)
     #define I16_MAX INT16_MAX
     #else
     #define I16_MAX PERL_SHORT_MAX
    
  • I could not find ANY place which defined INT16_MAX at all :(

    Someone help please!!!

  • PERL_SHORT_MAX is defined in perl.h:

     #ifdef SHORT_MAX
     #  define PERL_SHORT_MAX ((short)SHORT_MAX)
     #else
     #  ifdef MAXSHORT    /* Often used in <values.h> */
     #    define PERL_SHORT_MAX ((short)MAXSHORT)
     #  else
     #    ifdef SHRT_MAX
     #      define PERL_SHORT_MAX ((short)SHRT_MAX)
     #    else
     #      define PERL_SHORT_MAX      ((short) (PERL_USHORT_MAX >> 1))
     #    endif
     #  endif
     #endif
    

    I wasn't able to find any place which defined SHORT_MAX, MAXSHORT or SHRT_MAX so far. So the default of ((short) (PERL_USHORT_MAX >> 1)) it is assumed to be for now :)

  • PERL_USHORT_MAX is defined very similarly in perl.h, and again I couldn't find a trace of definition of USHORT_MAX/MAXUSHORT/USHRT_MAX.

    Which seems to imply that it's set by default to: #define PERL_USHORT_MAX ((unsigned short)~(unsigned)0). How to extract that value from Perl side, I have no clue - it's basically a number you get by bitwise negating a short 0, so if unsigned short is 16 bytes, then PERL_USHORT_MAX will be 16 ones, and PERL_SHORT_MAX will be 15 ones, e.g. 2^15-1, e.g. 32767.

  • Also, from t/re/pat.t (regex tests): $::reg_infty = $Config {reg_infty} // 32767; (to illustrate where the non-default compiled in value is stored).

So, to get your constant, you do:

use Config;
my $shortsize = $Config{shortsize} // 2;
$c_reg_infty = (defined $Config {reg_infty}) ? $Config {reg_infty}
                                             : ($shortsize > 2) ? 2**16-1
                                             : get_PERL_SHORT_MAX();
# Where get_PERL_SHORT_MAX() depends on logic for PERL_SHORT_MAX in perl.h
# which I'm not sure how to extract into Perl with any precision
# due to a bunch of never-seen "#define"s and unknown size of "short".
# You can probably do fairly well by simply returning 2**8-1 if shortsize==1 
# and 2^^16-1 otherwise.
say "REAL reg_infinity based on C headers: $c_reg_infty";
like image 116
DVK Avatar answered Oct 19 '22 04:10

DVK