Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is `\s+` so much faster than `\s\s*` in this Perl regex?

Why does replacing \s* (or even \s\s*) with \s+ result in such a speedup for this input?

use Benchmark qw(:all); $x=(" " x 100000) . "_\n"; $count = 100; timethese($count, {     '/\s\s*\n/' => sub { $x =~ /\s\s*\n/ },     '/\s+\n/' => sub { $x =~ /\s+\n/ }, }); 

Link to online version

I noticed a slow regex s/\s*\n\s*/\n/g in my code - when given a 450KB input file consisting of lots of spaces with a few non-spaces here and there, and a final newline at the end - the regex hung and never finished.

I intuitively replaced the regex with s/\s+\n/\n/g; s/\n\s+/\n/g; and all was well.

But why is it so much faster? After using re Debug => "EXECUTE" I noticed the \s+ version is somehow optimised to run in only one iteration: http://pastebin.com/0Ug6xPiQ

Matching REx "\s*\n" against "       _%n" Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d ][{non-utf8-latin1-all}{unicode_all}] against "       _%n" (9 bytes)    0 <> <       _%n>         |  1:STAR(3)                                   SPACE can match 7 times out of 2147483647...                                   failed...    1 < > <      _%n>         |  1:STAR(3)                                   SPACE can match 6 times out of 2147483647...                                   failed...    2 <  > <     _%n>         |  1:STAR(3)                                   SPACE can match 5 times out of 2147483647...                                   failed...    3 <   > <    _%n>         |  1:STAR(3)                                   SPACE can match 4 times out of 2147483647...                                   failed...    4 <    > <   _%n>         |  1:STAR(3)                                   SPACE can match 3 times out of 2147483647...                                   failed...    5 <     > <  _%n>         |  1:STAR(3)                                   SPACE can match 2 times out of 2147483647...                                   failed...    6 <      > < _%n>         |  1:STAR(3)                                   SPACE can match 1 times out of 2147483647...                                   failed...    8 <       _> <%n>         |  1:STAR(3)                                   SPACE can match 1 times out of 2147483647...    8 <       _> <%n>         |  3:  EXACT <\n>(5)    9 <       _%n> <>         |  5:  END(0) Match successful! 
Matching REx "\s+\n" against "       _%n" Matching stclass SPACE against "       _" (8 bytes)    0 <> <       _%n>         |  1:PLUS(3)                                   SPACE can match 7 times out of 2147483647...                                   failed... 

I know Perl 5.10+ will immediately fail the regex (without running it) if a newline is not present. I suspect it is using the location of the newline to reduce the amount of searching it does. For all cases above it seems to cleverly reduce the backtracking involved (usually /\s*\n/ against a string of spaces would take exponential-time). Can anyone offer insight into why the \s+ version is so much faster?

Also note that \s*? does not offer any speedup.

like image 867
rjh Avatar asked Jul 18 '16 08:07

rjh


People also ask

What is S in Perl regex?

To simplify multi-line substitutions, the "." character never matches a newline unless you use the /s modifier, which in effect tells Perl to pretend the string is a single line--even if it isn't.

What is the meaning of $1 in Perl regex?

$1 equals the text " brown ".

What does \s+ mean in Perl?

(\S+) | will match and capture any number (one or more) of non-space characters, followed by a space character (assuming the regular expression isn't modified with a /x flag). In both cases, these constructs appear to be one component of an alternation. Breaking it down: ( .... ) : Group and capture.


2 Answers

First, even if the resulting regex will not keep the same meaning, let's reduces regexes to \s*0 and \s+0 and use (" " x 4) . "_0" as an input. For the sceptics, you can see here that the lag is still present.

Now let's consider the following code:

$x = (" " x 4) . "_ 0"; $x =~ /\s*0/; # The slow line  $x =~ /\s+0/; # The fast line 

Digging a bit with use re debugcolor; we get the following output:

Guessing start of match in sv for REx "\s*0" against "    _0" Found floating substr "0" at offset 5... start_shift: 0 check_at: 5 s: 0 endpos: 6 checked_upto: 0 Does not contradict STCLASS... Guessed: match at offset 0 Matching REx "\s*0" against "    _0" Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes)    0 <    _0>|  1:STAR(3)                                   POSIXD[\s] can match 4 times out of 2147483647...                                   failed...    1 <    _0>|  1:STAR(3)                                   POSIXD[\s] can match 3 times out of 2147483647...                                   failed...    2 <    _0>|  1:STAR(3)                                   POSIXD[\s] can match 2 times out of 2147483647...                                   failed...    3 <    _0>|  1:STAR(3)                                   POSIXD[\s] can match 1 times out of 2147483647...                                   failed...    5 <    _0>|  1:STAR(3)                                   POSIXD[\s] can match 0 times out of 2147483647...    5 <    _0>|  3:  EXACT <0>(5)    6 <    _0>|  5:  END(0) Match successful!  -----------------------  Guessing start of match in sv for REx "\s+0" against "    _0" Found floating substr "0" at offset 5... start_shift: 1 check_at: 5 s: 0 endpos: 5 checked_upto: 0 Does not contradict STCLASS... Guessed: match at offset 0 Matching REx "\s+0" against "    _0" Matching stclass POSIXD[\s] against "    _" (5 bytes)    0 <    _0>|  1:PLUS(3)                                   POSIXD[\s] can match 4 times out of 2147483647...                                   failed... Contradicts stclass... [regexec_flags] Match failed 

Perl seems to be optimized for failure. It will first look for constant strings (which only consume O(N)). Here, it'll look for 0 : Found floating substr "0" at offset 5...

Then it'll start with the variable part of the regex, respectivly \s* and \s+, against the whole minimum string to check:

Matching REx "\s*0" against "    _0" Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes) Matching REx "\s+0" against "    _0" Matching stclass POSIXD[\s] against "    _" (5 bytes) # Only 5 bytes because there should be at least 1 "\s" char 

After that it'll look for the first position meeting the stclass requirement, here at position 0.

  • \s*0:
    • starts at 0, find 4 spaces then fail;
    • starts at 1, find 3 spaces then fail;
    • starts at 2, find 2 spaces then fail;
    • starts at 3, find 1 spaces then fail;
    • starts at 4, find 0 spaces then doesn't fail;
    • Find an exact 0
  • \s+0:
    • starts at 0, find 4 spaces then fail. As the minimum number of spaces is not matched, the regex fails instantly.

If you want to have fun with Perl regex optimization, you can consider the following regexes / *\n and / * \n. At first glance, they look the same, have the same meaning... But if you run its against (" " x 40000) . "_\n" the first one will check all possibilities while the second one will look for " \n" and fail immediately.

In a vanilla, non-optimized regex engine, both regex can cause catastrophic backtracking, since they need to retry the pattern as it bumps along. However, in the example above, the second doesn't fail with Perl because it have been optimized to find floating substr "0%n"


You can see another example on Jeff Atwood's blog.

Note also that the issue is not about \s consideration but any pattern where xx* is used instead of x+ see example with 0s and also regex explosive quantifiers

With such shorter example the behavior is "findable", but if you start to play with complicated patterns, it's far from easy to spot, for example: Regular expression hangs program (100% CPU usage)

like image 164
Thomas Ayoub Avatar answered Oct 08 '22 19:10

Thomas Ayoub


When there is a "plus" node (e.g. \s+) at the beginning of a pattern and the node fails to match, the regex engine skips forward to the point of failure and tries again; with \s*, on the other hand, the engine only advances one character at a time.

Yves Orton explains this optimization nicely here:

The start class optimisation has two modes, "try every valid start position" (doevery) and "flip flop mode" (!doevery) where it trys only the first valid start position in a sequence.

Consider /(\d+)X/ and the string "123456Y", now we know that if we fail to match X after matching "123456" then we will also fail to match after "23456" (assuming no evil tricks are in place, which disable the optimisation anyway), so we know we can skip forward until the check /fails/ and only then start looking for a real match. This is flip-flop mode.

/\s+/ triggers flip-flop mode; /\s*/, /\s\s*/, and /\s\s+/ don't. This optimization can't be applied to "star" nodes like \s* because they can match zero characters, so a failure at one point in a sequence isn't indicative of failure later in the same sequence.


You can see this in the debug output for each regex. I've highlighted the skipped characters with ^. Compare this (skips four characters at a time):

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d+x/'    ...    0 <> <123 456 78>         |  1:PLUS(3)                                   POSIXD[\d] can match 3 times out of 2147483647...                                   failed...    4 <123 > <456 789 x>      |  1:PLUS(3)       ^^^^                                   POSIXD[\d] can match 3 times out of 2147483647...                                   failed...    8 <23 456 > <789 x>       |  1:PLUS(3)          ^^^^                                   POSIXD[\d] can match 3 times out of 2147483647...                                   failed... 

to this (skips one or two characters at a time):

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d*x/'    ...    0 <> <123 456 78>         |  1:STAR(3)                                   POSIXD[\d] can match 3 times out of 2147483647...                                   failed...    1 <1> <23 456 789>        |  1:STAR(3)       ^                                   POSIXD[\d] can match 2 times out of 2147483647...                                   failed...    2 <12> <3 456 789 >       |  1:STAR(3)        ^                                   POSIXD[\d] can match 1 times out of 2147483647...                                   failed...    4 <123 > <456 789 x>      |  1:STAR(3)         ^^                                   POSIXD[\d] can match 3 times out of 2147483647...                                   failed...    5 <123 4> <56 789 x>      |  1:STAR(3)           ^                                   POSIXD[\d] can match 2 times out of 2147483647...                                   failed...    6 <23 45> <6 789 x>       |  1:STAR(3)           ^                                   POSIXD[\d] can match 1 times out of 2147483647...                                   failed...    8 <23 456 > <789 x>       |  1:STAR(3)            ^^                                   POSIXD[\d] can match 3 times out of 2147483647...                                   failed...    9 <23 456 7> <89 x>       |  1:STAR(3)              ^                                   POSIXD[\d] can match 2 times out of 2147483647...                                   failed...   10 <23 456 78> <9 x>       |  1:STAR(3)               ^                                   POSIXD[\d] can match 1 times out of 2147483647...                                   failed...   12 <23 456 789 > <x>       |  1:STAR(3)                ^^                                   POSIXD[\d] can match 0 times out of 2147483647...   12 <23 456 789 > <x>       |  3:  EXACT <x>(5)   13 <23 456 789 x> <>       |  5:  END(0) 

Note that the optimization isn't applied to /\s\s+/, because \s+ isn't at the beginning of the pattern. Both /\s\s+/ (logically equivalent to /\s{2,}/) and /\s\s*/ (logically equivalent to /\s+/) probably could be optimized, though; it might make sense to ask on perl5-porters whether either would be worth the effort.


In case you're interested, "flip-flop mode" is enabled by setting the PREGf_SKIP flag on a regex when it's compiled. See the code around lines 7344 and 7405 in regcomp.c and line 1585 in regexec.c in the 5.24.0 source.

like image 22
ThisSuitIsBlackNot Avatar answered Oct 08 '22 20:10

ThisSuitIsBlackNot