Why can the regex /[\w\W]+x/i be extremely slow to run?

Tags:

perl

try:

time perl -E '$x="a" x 100000; $x =~ /[\w\W]+x/i'

will run long time (on my notebook 20secs). Without the /i, e.g.

time perl -E '$x="a" x 100000; $x =~ /[\w\W]+x/'

finishes in 0.07 secs.

Regardless of the regex [\w\W] makes not much sense, this enormous difference surprises me.

Why such big difference?

EDIT

To be more precise:

$ time perl -E '$x="a" x 100000; $x =~ /[\w\W]+x/i'  

real    0m19.479s
user    0m19.419s
sys 0m0.038s

my perl

Summary of my perl5 (revision 5 version 20 subversion 3) configuration:

  Platform:
    osname=darwin, osvers=15.0.0, archname=darwin-2level
    uname='darwin nox.local 15.0.0 darwin kernel version 15.0.0: sat sep 19 15:53:46 pdt 2015; root:xnu-3247.10.11~1release_x86_64 x86_64 '
    config_args='-Dprefix=/opt/anyenv/envs/plenv/versions/5.20.3 -de -Dusedevel -A'eval:scriptdir=/opt/anyenv/envs/plenv/versions/5.20.3/bin''
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe -fstack-protector -I/opt/local/include',
    optimize='-O3',
    cppflags='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe -fstack-protector -I/opt/local/include'
    ccversion='', gccversion='4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.1.76)', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' -fstack-protector -L/opt/local/lib'
    libpth=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/clang/7.0.0/lib /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib /usr/lib /opt/local/lib
    libs=-lpthread -lgdbm -ldbm -ldl -lm -lutil -lc
    perllibs=-lpthread -ldl -lm -lutil -lc
    libc=, so=dylib, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/opt/local/lib -fstack-protector'


Characteristics of this binary (from libperl): 
  Compile-time options: HAS_TIMES PERLIO_LAYERS PERL_DONT_CREATE_GVSV
                        PERL_HASH_FUNC_ONE_AT_A_TIME_HARD PERL_MALLOC_WRAP
                        PERL_NEW_COPY_ON_WRITE PERL_PRESERVE_IVUV
                        PERL_USE_DEVEL USE_64_BIT_ALL USE_64_BIT_INT
                        USE_LARGE_FILES USE_LOCALE USE_LOCALE_COLLATE
                        USE_LOCALE_CTYPE USE_LOCALE_NUMERIC USE_PERLIO
                        USE_PERL_ATOF
  Locally applied patches:
    Devel::PatchPerl 1.38
  Built under darwin
  Compiled at Oct 28 2015 14:46:19
  @INC:
    /opt/anyenv/envs/plenv/versions/5.20.3/lib/perl5/site_perl/5.20.3/darwin-2level
    /opt/anyenv/envs/plenv/versions/5.20.3/lib/perl5/site_perl/5.20.3
    /opt/anyenv/envs/plenv/versions/5.20.3/lib/perl5/5.20.3/darwin-2level
    /opt/anyenv/envs/plenv/versions/5.20.3/lib/perl5/5.20.3
    .

And from the background of the question: the real code matches the string against an large list of regexes (sort of antispam), so i can't reliable manually check the regex DB. The real fragment of code is

sub docheck {
    ...
    ...
    foreach my $regex (@$regexs) {
        if ( $_[0] =~ /$regex/i ) {

and the [\w\W]+ is one of the 10k regexes :(, such: [\w\W]+medicine\.netfirms\.com - The regex-DB probably needs cleaning - but... you know :)

Now the code is changed:

sub docheck {
    ...
    my $str = lc($_[0]);
    foreach my $regex (@$regexs) {
        if ( $str =~ /$regex/ ) {

so avoiding the /i.

291

asked Nov 30 '15 23:11

jm666

2 Answers

TL;DR

In the second case, the optimizer is quite smart and realizes there's no "x" in the string, so there can't be a possible match, and fails earlier. However, for the /i case, it's not so smart to test for both upper and lowercase x, so it goes on and tries to match the whole regex.

Debugging

Although I can't reproduce such big differences in performance, there's an optimization that is triggered for the case-sensitive match.

Let's run it in 'debug' mode:

Code

use re 'debug';
$x="a" x 100000;
$x =~ /[\w\W]+x/;

You could also add -Mre=debug to perl invocation.

Output

Compiling REx "[\w\W]+x"
Final program:
   1: PLUS (13)
   2:   ANYOF[\x{00}-\x{7F}][{non-utf8-latin1-all}{unicode_all}] (0)
  13: EXACT <x> (15)
  15: END (0)
floating "x" at 1..9223372036854775807 (checking floating) stclass ANYOF[\x{00}-\x{7F}][{non-utf8-latin1-all}{unicode_all}] plus minlen 2 
Matching REx "[\w\W]+x" against "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"...
Intuit: trying to determine minimum start position...
  Did not find floating substr "x"...
Match rejected by optimizer
Freeing REx: "[\w\W]+x"

And notice the last part:

Intuit: trying to determine minimum start position...
  Did not find floating substr "x"...
Match rejected by optimizer

The optimizer attempts to find the first occurrence of "x" and, since it can't find it, it rejects the match before the regex engine even attempts it.

Perl regex is optimized to rather fail as early as possible, than to succeed.

Slow case

I can't reproduce the same behaviour on my end (Perl v5.20.2), it fails on a later optimization, probably because of version-specific differences. However, the optimization for the case-insensitive "x" does not occur in this case.

This optimization is not triggered because it has more than 1 possibility for a match in the subject (both the lowercase "x" and the uppercase "X").

Check ThisSuitIsBlackNot's answer if you're interested in this optimization internals.

Now, without the optimization, the regex engine theoretically attempts to match "x" for:

every possible match in [\w\W]+ (consuming the whole string, then backtracking 1 char, and another... and so on), and
every starting position in the subject string (99,999 positions).

Of course, there are other optimizations that reduce this number, but that's why it's so slow. Notice it increases exponentially with longer strings.

Work-around

If you don't particularly require there to be at least one character before x, you should use /.*x/is, since it fails after attempting a match in the first position (the optimizer actually anchors .* to the start of text).
_{* Thanks to @nhahtdh for bringing this up.}

However, for a more general case where this behaviour could arise, one option to increase performance is check for "x" before the rest (either as an independent condition or in the same regex):

$x =~ /(?:^(*COMMIT)(?=.*x))?[\w\W]+x/is;

(?:^ ... )? Checks only once, if at start of string.
(?=.*x) If there's an x ahead
(*COMMIT) Otherwise, it backtracks and COMMIT is a control verb that makes the entire match to fail.

This will run much faster.

answered Oct 05 '22 04:10

Mariano

Mariano is exactly right: the difference in performance is thanks to the optimizer. To find out why the optimizer is triggered in one case but not the other requires a dive into the Perl source code.

A lot of the optimizer code relies on two pieces of data about the pattern:

the longest fixed substring, and
the longest floating substring

This is explained in the comments in regcomp.c in the Perl source:

During optimization we...[get] information about what strings MUST appear in the pattern. We look for the longest string that must appear at a fixed location, and we look for the longest string that may appear at a floating location. So for instance in the pattern:
/FOO[xX]A.*B[xX]BAR/
Both 'FOO' and 'A' are fixed strings. Both 'B' and 'BAR' are floating strings (because they follow a .* construct). study_chunk will identify both FOO and BAR as being the longest fixed and floating strings respectively.

(from Perl 5.22.0)

Why does the optimizer use these substrings? Because it's much easier to fail fast when you can do simple string comparisons and length checks instead of running the full regex engine.

So with /.+x/ we have:

longest fixed substring: none
longest floating substring: x

And with /.+x/i we have:

longest fixed substring: none
longest floating substring: none (case folding means we're no longer working with a simple literal string)

Now, when a pattern is compiled that contains a literal string (either fixed or floating), a special flag, RXf_USE_INTUIT, is set in the compiled regex. When the regex is executed, this flag triggers an optimization routine called re_intuit_start(), which lives in regexec.c:

/* re_intuit_start():
 *
 * Based on some optimiser hints, try to find the earliest position in the
 * string where the regex could match.
 *
 * ...
 *
 * The basic idea of re_intuit_start() is to use some known information
 * about the pattern, namely:
 *
 *   a) the longest known anchored substring (i.e. one that's at a
 *      constant offset from the beginning of the pattern; but not
 *      necessarily at a fixed offset from the beginning of the
 *      string);
 *   b) the longest floating substring (i.e. one that's not at a constant
 *      offset from the beginning of the pattern);
 *   c) Whether the pattern is anchored to the string; either
 *      an absolute anchor: /^../, or anchored to \n: /^.../m,
 *      or anchored to pos(): /\G/;
 *   d) A start class: a real or synthetic character class which
 *      represents which characters are legal at the start of the pattern;
 *
 * to either quickly reject the match, or to find the earliest position
 * within the string at which the pattern might match, thus avoiding
 * running the full NFA engine at those earlier locations, only to
 * eventually fail and retry further along.

With /.+x/, re_intuit_start() is triggered and searches the string being matched for the longest floating substring (x). When it fails to find any xs, the entire match can be rejected immediately.

With /.+x/i, on the other hand, re_intuit_start() is never triggered, so we lose out on our fail-fast optimization.

answered Oct 05 '22 02:10

ThisSuitIsBlackNot

Related questions
                            
                                Is it possible to erase a capture group that has already matched, making it non-participating?
                            
                                Apply list of regex pattern on list python
                            
                                regular expression to remove links [duplicate]
                            
                                Bug in Mathematica: regular expression applied to very long string
                            
                                What are the valid regular expression formats for the nonProxyHosts property in Java?
                            
                                Perl - get first "word" from input string
                            
                                email address hiding some characters with c#, regex
                            
                                Having a dot in the id of rails routes
                            
                                Using Non-Capturing Groups in MySQL REGEXP
                            
                                Why do we need to escape the ! < > : = - in php regular expressions?
                            
                                vba positive lookahead is too greedy
                            
                                What's a prefix regular expression?
                            
                                denying special characters jQuery Validate
                            
                                How do I use Java Regex to find all repeating character sequences in a string?
                            
                                Regex checking repeat
                            
                                Pattern Matching Java Regex - "One-or-more" operator
                            
                                Extract x,y coordinates from arbitrary SVG path with javascript
                            
                                Escaping close-bracket character in grep complement
                            
                                Exclude paths in routing
                            
                                Convert markdown links from inline to reference

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With