Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can the regex /[\w\W]+x/i be extremely slow to run?

Tags:

regex

perl

try:

time perl -E '$x="a" x 100000; $x =~ /[\w\W]+x/i'  

will run long time (on my notebook 20secs). Without the /i, e.g.

time perl -E '$x="a" x 100000; $x =~ /[\w\W]+x/' 

finishes in 0.07 secs.

Regardless of the regex [\w\W] makes not much sense, this enormous difference surprises me.

Why such big difference?

EDIT

To be more precise:

$ time perl -E '$x="a" x 100000; $x =~ /[\w\W]+x/i'  

real    0m19.479s
user    0m19.419s
sys 0m0.038s

my perl

Summary of my perl5 (revision 5 version 20 subversion 3) configuration:

  Platform:
    osname=darwin, osvers=15.0.0, archname=darwin-2level
    uname='darwin nox.local 15.0.0 darwin kernel version 15.0.0: sat sep 19 15:53:46 pdt 2015; root:xnu-3247.10.11~1release_x86_64 x86_64 '
    config_args='-Dprefix=/opt/anyenv/envs/plenv/versions/5.20.3 -de -Dusedevel -A'eval:scriptdir=/opt/anyenv/envs/plenv/versions/5.20.3/bin''
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe -fstack-protector -I/opt/local/include',
    optimize='-O3',
    cppflags='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe -fstack-protector -I/opt/local/include'
    ccversion='', gccversion='4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.1.76)', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' -fstack-protector -L/opt/local/lib'
    libpth=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/clang/7.0.0/lib /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib /usr/lib /opt/local/lib
    libs=-lpthread -lgdbm -ldbm -ldl -lm -lutil -lc
    perllibs=-lpthread -ldl -lm -lutil -lc
    libc=, so=dylib, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/opt/local/lib -fstack-protector'


Characteristics of this binary (from libperl): 
  Compile-time options: HAS_TIMES PERLIO_LAYERS PERL_DONT_CREATE_GVSV
                        PERL_HASH_FUNC_ONE_AT_A_TIME_HARD PERL_MALLOC_WRAP
                        PERL_NEW_COPY_ON_WRITE PERL_PRESERVE_IVUV
                        PERL_USE_DEVEL USE_64_BIT_ALL USE_64_BIT_INT
                        USE_LARGE_FILES USE_LOCALE USE_LOCALE_COLLATE
                        USE_LOCALE_CTYPE USE_LOCALE_NUMERIC USE_PERLIO
                        USE_PERL_ATOF
  Locally applied patches:
    Devel::PatchPerl 1.38
  Built under darwin
  Compiled at Oct 28 2015 14:46:19
  @INC:
    /opt/anyenv/envs/plenv/versions/5.20.3/lib/perl5/site_perl/5.20.3/darwin-2level
    /opt/anyenv/envs/plenv/versions/5.20.3/lib/perl5/site_perl/5.20.3
    /opt/anyenv/envs/plenv/versions/5.20.3/lib/perl5/5.20.3/darwin-2level
    /opt/anyenv/envs/plenv/versions/5.20.3/lib/perl5/5.20.3
    .

And from the background of the question: the real code matches the string against an large list of regexes (sort of antispam), so i can't reliable manually check the regex DB. The real fragment of code is

sub docheck {
    ...
    ...
    foreach my $regex (@$regexs) {
        if ( $_[0] =~ /$regex/i ) {

and the [\w\W]+ is one of the 10k regexes :(, such: [\w\W]+medicine\.netfirms\.com - The regex-DB probably needs cleaning - but... you know :)

Now the code is changed:

sub docheck {
    ...
    my $str = lc($_[0]);
    foreach my $regex (@$regexs) {
        if ( $str =~ /$regex/ ) {

so avoiding the /i.

like image 291
jm666 Avatar asked Nov 30 '15 23:11

jm666


People also ask

Why is regex so slow?

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

How do I make regular expressions faster?

Expose Literal Characters Regex engines match fastest when anchors and literal characters are right there in the main pattern, rather than buried in sub-expressions. Hence the advice to "expose" literal characters whenever you can take them out of an alternation or quantified expression. Let's look at two examples.

How fast is regex matching?

Performance of the Best Regex When I reran the test, the best regex took about the same amount of time to match the non-matching input, but the matching input took only on average 800 milliseconds to run, as opposed to 4,700 milliseconds for the better regex and a whopping 17,000 milliseconds for the bad regex.

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.


2 Answers

TL;DR

In the second case, the optimizer is quite smart and realizes there's no "x" in the string, so there can't be a possible match, and fails earlier. However, for the /i case, it's not so smart to test for both upper and lowercase x, so it goes on and tries to match the whole regex.


Debugging

Although I can't reproduce such big differences in performance, there's an optimization that is triggered for the case-sensitive match.

Let's run it in 'debug' mode:

Code

use re 'debug';
$x="a" x 100000;
$x =~ /[\w\W]+x/;
  • You could also add -Mre=debug to perl invocation.

Output

Compiling REx "[\w\W]+x"
Final program:
   1: PLUS (13)
   2:   ANYOF[\x{00}-\x{7F}][{non-utf8-latin1-all}{unicode_all}] (0)
  13: EXACT <x> (15)
  15: END (0)
floating "x" at 1..9223372036854775807 (checking floating) stclass ANYOF[\x{00}-\x{7F}][{non-utf8-latin1-all}{unicode_all}] plus minlen 2 
Matching REx "[\w\W]+x" against "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"...
Intuit: trying to determine minimum start position...
  Did not find floating substr "x"...
Match rejected by optimizer
Freeing REx: "[\w\W]+x"

And notice the last part:

Intuit: trying to determine minimum start position...
  Did not find floating substr "x"...
Match rejected by optimizer

The optimizer attempts to find the first occurrence of "x" and, since it can't find it, it rejects the match before the regex engine even attempts it.

Perl regex is optimized to rather fail as early as possible, than to succeed.


Slow case

I can't reproduce the same behaviour on my end (Perl v5.20.2), it fails on a later optimization, probably because of version-specific differences. However, the optimization for the case-insensitive "x" does not occur in this case.

This optimization is not triggered because it has more than 1 possibility for a match in the subject (both the lowercase "x" and the uppercase "X").

  • Check ThisSuitIsBlackNot's answer if you're interested in this optimization internals.

Now, without the optimization, the regex engine theoretically attempts to match "x" for:

  • every possible match in [\w\W]+ (consuming the whole string, then backtracking 1 char, and another... and so on), and
  • every starting position in the subject string (99,999 positions).

Of course, there are other optimizations that reduce this number, but that's why it's so slow. Notice it increases exponentially with longer strings.

Work-around

If you don't particularly require there to be at least one character before x, you should use /.*x/is, since it fails after attempting a match in the first position (the optimizer actually anchors .* to the start of text).
* Thanks to @nhahtdh for bringing this up.

However, for a more general case where this behaviour could arise, one option to increase performance is check for "x" before the rest (either as an independent condition or in the same regex):

$x =~ /(?:^(*COMMIT)(?=.*x))?[\w\W]+x/is;
  • (?:^ ... )? Checks only once, if at start of string.
  • (?=.*x) If there's an x ahead
  • (*COMMIT) Otherwise, it backtracks and COMMIT is a control verb that makes the entire match to fail.

This will run much faster.

like image 51
Mariano Avatar answered Oct 05 '22 04:10

Mariano


Mariano is exactly right: the difference in performance is thanks to the optimizer. To find out why the optimizer is triggered in one case but not the other requires a dive into the Perl source code.

A lot of the optimizer code relies on two pieces of data about the pattern:

  • the longest fixed substring, and

  • the longest floating substring

This is explained in the comments in regcomp.c in the Perl source:

During optimization we...[get] information about what strings MUST appear in the pattern. We look for the longest string that must appear at a fixed location, and we look for the longest string that may appear at a floating location. So for instance in the pattern:

/FOO[xX]A.*B[xX]BAR/

Both 'FOO' and 'A' are fixed strings. Both 'B' and 'BAR' are floating strings (because they follow a .* construct). study_chunk will identify both FOO and BAR as being the longest fixed and floating strings respectively.

(from Perl 5.22.0)

Why does the optimizer use these substrings? Because it's much easier to fail fast when you can do simple string comparisons and length checks instead of running the full regex engine.

So with /.+x/ we have:

  • longest fixed substring: none
  • longest floating substring: x

And with /.+x/i we have:

  • longest fixed substring: none
  • longest floating substring: none (case folding means we're no longer working with a simple literal string)

Now, when a pattern is compiled that contains a literal string (either fixed or floating), a special flag, RXf_USE_INTUIT, is set in the compiled regex. When the regex is executed, this flag triggers an optimization routine called re_intuit_start(), which lives in regexec.c:

/* re_intuit_start():
 *
 * Based on some optimiser hints, try to find the earliest position in the
 * string where the regex could match.
 *
 * ...
 *
 * The basic idea of re_intuit_start() is to use some known information
 * about the pattern, namely:
 *
 *   a) the longest known anchored substring (i.e. one that's at a
 *      constant offset from the beginning of the pattern; but not
 *      necessarily at a fixed offset from the beginning of the
 *      string);
 *   b) the longest floating substring (i.e. one that's not at a constant
 *      offset from the beginning of the pattern);
 *   c) Whether the pattern is anchored to the string; either
 *      an absolute anchor: /^../, or anchored to \n: /^.../m,
 *      or anchored to pos(): /\G/;
 *   d) A start class: a real or synthetic character class which
 *      represents which characters are legal at the start of the pattern;
 *
 * to either quickly reject the match, or to find the earliest position
 * within the string at which the pattern might match, thus avoiding
 * running the full NFA engine at those earlier locations, only to
 * eventually fail and retry further along.

With /.+x/, re_intuit_start() is triggered and searches the string being matched for the longest floating substring (x). When it fails to find any xs, the entire match can be rejected immediately.

With /.+x/i, on the other hand, re_intuit_start() is never triggered, so we lose out on our fail-fast optimization.

like image 35
ThisSuitIsBlackNot Avatar answered Oct 05 '22 02:10

ThisSuitIsBlackNot