Why does replacing <code>\s*</code> (or even <code>\s\s*</code>) with <code>\s+</code> result in such a speedup for this input? <pre class="prettyprint"><code>use Benchmark qw(:all); $x=(" " x 100000) . "_\n"; $count = 100; timethese($count, { '/\s\s*\n/' => sub { $x =~ /\s\s*\n/ }, '/\s+\n/' => sub { $x =~ /\s+\n/ }, }); </code></pre> Link to online version I noticed a slow regex <code>s/\s*\n\s*/\n/g</code> in my code - when given a 450KB input file consisting of lots of spaces with a few non-spaces here and there, and a final newline at the end - the regex hung and never finished. I intuitively replaced the regex with <code>s/\s+\n/\n/g; s/\n\s+/\n/g;</code> and all was well. But why is it so much faster? After using <code>re Debug => "EXECUTE"</code> I noticed the <code>\s+</code> version is somehow optimised to run in only one iteration: http://pastebin.com/0Ug6xPiQ <pre class="prettyprint lang-none prettyprint-override"><code>Matching REx "\s*\n" against " _%n" Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d ][{non-utf8-latin1-all}{unicode_all}] against " _%n" (9 bytes) 0 <> < _%n> | 1:STAR(3) SPACE can match 7 times out of 2147483647... failed... 1 < > < _%n> | 1:STAR(3) SPACE can match 6 times out of 2147483647... failed... 2 < > < _%n> | 1:STAR(3) SPACE can match 5 times out of 2147483647... failed... 3 < > < _%n> | 1:STAR(3) SPACE can match 4 times out of 2147483647... failed... 4 < > < _%n> | 1:STAR(3) SPACE can match 3 times out of 2147483647... failed... 5 < > < _%n> | 1:STAR(3) SPACE can match 2 times out of 2147483647... failed... 6 < > < _%n> | 1:STAR(3) SPACE can match 1 times out of 2147483647... failed... 8 < _> <%n> | 1:STAR(3) SPACE can match 1 times out of 2147483647... 8 < _> <%n> | 3: EXACT <\n>(5) 9 < _%n> <> | 5: END(0) Match successful! </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>Matching REx "\s+\n" against " _%n" Matching stclass SPACE against " _" (8 bytes) 0 <> < _%n> | 1:PLUS(3) SPACE can match 7 times out of 2147483647... failed... </code></pre> I know Perl 5.10+ will immediately fail the regex (without running it) if a newline is not present. I suspect it is using the location of the newline to reduce the amount of searching it does. For all cases above it seems to cleverly reduce the backtracking involved (usually <code>/\s*\n/</code> against a string of spaces would take exponential-time). Can anyone offer insight into why the <code>\s+</code> version is so much faster? Also note that <code>\s*?</code> does not offer any speedup.

First, even if the resulting regex will not keep the same meaning, let's reduces regexes to <code>\s*0</code> and <code>\s+0</code> and use <code>(" " x 4) . "_0"</code> as an input. For the sceptics, you can see here that the lag is still present. Now let's consider the following code: <pre class="prettyprint"><code>$x = (" " x 4) . "_ 0"; $x =~ /\s*0/; # The slow line $x =~ /\s+0/; # The fast line </code></pre> Digging a bit with <code>use re debugcolor;</code> we get the following output: <pre class="prettyprint lang-none prettyprint-override"><code>Guessing start of match in sv for REx "\s*0" against " _0" Found floating substr "0" at offset 5... start_shift: 0 check_at: 5 s: 0 endpos: 6 checked_upto: 0 Does not contradict STCLASS... Guessed: match at offset 0 Matching REx "\s*0" against " _0" Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against " _0" (6 bytes) 0 < _0>| 1:STAR(3) POSIXD[\s] can match 4 times out of 2147483647... failed... 1 < _0>| 1:STAR(3) POSIXD[\s] can match 3 times out of 2147483647... failed... 2 < _0>| 1:STAR(3) POSIXD[\s] can match 2 times out of 2147483647... failed... 3 < _0>| 1:STAR(3) POSIXD[\s] can match 1 times out of 2147483647... failed... 5 < _0>| 1:STAR(3) POSIXD[\s] can match 0 times out of 2147483647... 5 < _0>| 3: EXACT <0>(5) 6 < _0>| 5: END(0) Match successful! ----------------------- Guessing start of match in sv for REx "\s+0" against " _0" Found floating substr "0" at offset 5... start_shift: 1 check_at: 5 s: 0 endpos: 5 checked_upto: 0 Does not contradict STCLASS... Guessed: match at offset 0 Matching REx "\s+0" against " _0" Matching stclass POSIXD[\s] against " _" (5 bytes) 0 < _0>| 1:PLUS(3) POSIXD[\s] can match 4 times out of 2147483647... failed... Contradicts stclass... [regexec_flags] Match failed </code></pre> Perl seems to be optimized for failure. It will first look for constant strings (which only consume O(N)). Here, it'll look for <code>0</code> : <code>Found floating substr "0" at offset 5...</code> Then it'll start with the variable part of the regex, respectivly <code>\s*</code> and <code>\s+</code>, against the whole minimum string to check: <pre class="prettyprint lang-none prettyprint-override"><code>Matching REx "\s*0" against " _0" Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against " _0" (6 bytes) Matching REx "\s+0" against " _0" Matching stclass POSIXD[\s] against " _" (5 bytes) # Only 5 bytes because there should be at least 1 "\s" char </code></pre> After that it'll look for the first position meeting the <code>stclass</code> requirement, here at position 0. <ul> <li> <code>\s*0</code>: <ul> <li>starts at 0, find 4 spaces then fail;</li> <li>starts at 1, find 3 spaces then fail;</li> <li>starts at 2, find 2 spaces then fail;</li> <li>starts at 3, find 1 spaces then fail;</li> <li>starts at 4, find 0 spaces then doesn't fail;</li> <li>Find an exact <code>0</code> </li> </ul> </li> <li> <code>\s+0</code>: <ul> <li>starts at 0, find 4 spaces then fail. As the minimum number of spaces is not matched, the regex fails instantly.</li> </ul> </li> </ul> If you want to have fun with Perl regex optimization, you can consider the following regexes <code>/ *\n</code> and <code>/ * \n</code>. At first glance, they look the same, have the same meaning... But if you run its against <code>(" " x 40000) . "_\n"</code> the first one will check all possibilities while the second one will look for <code>" \n"</code> and fail immediately. In a vanilla, non-optimized regex engine, both regex can cause catastrophic backtracking, since they need to retry the pattern as it bumps along. However, in the example above, the second doesn't fail with Perl because it have been optimized to <code>find floating substr "0%n"</code> <hr> You can see another example on Jeff Atwood's blog. Note also that the issue is not about <code>\s</code> consideration but any pattern where <code>xx*</code> is used instead of <code>x+</code> see example with 0s and also regex explosive quantifiers With such shorter example the behavior is "findable", but if you start to play with complicated patterns, it's far from easy to spot, for example: Regular expression hangs program (100% CPU usage)

When there is a "plus" node (e.g. <code>\s+</code>) at the beginning of a pattern and the node fails to match, the regex engine skips forward to the point of failure and tries again; with <code>\s*</code>, on the other hand, the engine only advances one character at a time. Yves Orton explains this optimization nicely here: <blockquote> The start class optimisation has two modes, "try every valid start position" (doevery) and "flip flop mode" (!doevery) where it trys only the first valid start position in a sequence. Consider /(\d+)X/ and the string "123456Y", now we know that if we fail to match X after matching "123456" then we will also fail to match after "23456" (assuming no evil tricks are in place, which disable the optimisation anyway), so we know we can skip forward until the check /fails/ and only then start looking for a real match. This is flip-flop mode. </blockquote> <code>/\s+/</code> triggers flip-flop mode; <code>/\s*/</code>, <code>/\s\s*/</code>, and <code>/\s\s+/</code> don't. This optimization can't be applied to "star" nodes like <code>\s*</code> because they can match zero characters, so a failure at one point in a sequence isn't indicative of failure later in the same sequence. <hr> You can see this in the debug output for each regex. I've highlighted the skipped characters with <code>^</code>. Compare this (skips four characters at a time): <pre class="prettyprint lang-none prettyprint-override"><code>$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d+x/' ... 0 <> <123 456 78> | 1:PLUS(3) POSIXD[\d] can match 3 times out of 2147483647... failed... 4 <123 > <456 789 x> | 1:PLUS(3) ^^^^ POSIXD[\d] can match 3 times out of 2147483647... failed... 8 <23 456 > <789 x> | 1:PLUS(3) ^^^^ POSIXD[\d] can match 3 times out of 2147483647... failed... </code></pre> to this (skips one or two characters at a time): <pre class="prettyprint lang-none prettyprint-override"><code>$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d*x/' ... 0 <> <123 456 78> | 1:STAR(3) POSIXD[\d] can match 3 times out of 2147483647... failed... 1 <1> <23 456 789> | 1:STAR(3) ^ POSIXD[\d] can match 2 times out of 2147483647... failed... 2 <12> <3 456 789 > | 1:STAR(3) ^ POSIXD[\d] can match 1 times out of 2147483647... failed... 4 <123 > <456 789 x> | 1:STAR(3) ^^ POSIXD[\d] can match 3 times out of 2147483647... failed... 5 <123 4> <56 789 x> | 1:STAR(3) ^ POSIXD[\d] can match 2 times out of 2147483647... failed... 6 <23 45> <6 789 x> | 1:STAR(3) ^ POSIXD[\d] can match 1 times out of 2147483647... failed... 8 <23 456 > <789 x> | 1:STAR(3) ^^ POSIXD[\d] can match 3 times out of 2147483647... failed... 9 <23 456 7> <89 x> | 1:STAR(3) ^ POSIXD[\d] can match 2 times out of 2147483647... failed... 10 <23 456 78> <9 x> | 1:STAR(3) ^ POSIXD[\d] can match 1 times out of 2147483647... failed... 12 <23 456 789 > <x> | 1:STAR(3) ^^ POSIXD[\d] can match 0 times out of 2147483647... 12 <23 456 789 > <x> | 3: EXACT <x>(5) 13 <23 456 789 x> <> | 5: END(0) </code></pre> Note that the optimization isn't applied to <code>/\s\s+/</code>, because <code>\s+</code> isn't at the beginning of the pattern. Both <code>/\s\s+/</code> (logically equivalent to <code>/\s{2,}/</code>) and <code>/\s\s*/</code> (logically equivalent to <code>/\s+/</code>) probably could be optimized, though; it might make sense to ask on perl5-porters whether either would be worth the effort. <hr> In case you're interested, "flip-flop mode" is enabled by setting the <code>PREGf_SKIP</code> flag on a regex when it's compiled. See the code around lines 7344 and 7405 in regcomp.c and line 1585 in regexec.c in the 5.24.0 source.

Why is `\s+` so much faster than `\s\s*` in this Perl regex?

Tags:

regex

perl

regex-greedy

Why does replacing \s* (or even \s\s*) with \s+ result in such a speedup for this input?

use Benchmark qw(:all); $x=(" " x 100000) . "_\n"; $count = 100; timethese($count, {     '/\s\s*\n/' => sub { $x =~ /\s\s*\n/ },     '/\s+\n/' => sub { $x =~ /\s+\n/ }, });

Link to online version

I noticed a slow regex s/\s*\n\s*/\n/g in my code - when given a 450KB input file consisting of lots of spaces with a few non-spaces here and there, and a final newline at the end - the regex hung and never finished.

I intuitively replaced the regex with s/\s+\n/\n/g; s/\n\s+/\n/g; and all was well.

But why is it so much faster? After using re Debug => "EXECUTE" I noticed the \s+ version is somehow optimised to run in only one iteration: http://pastebin.com/0Ug6xPiQ

Matching REx "\s*\n" against "       _%n" Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d ][{non-utf8-latin1-all}{unicode_all}] against "       _%n" (9 bytes)    0 <> <       _%n>         |  1:STAR(3)                                   SPACE can match 7 times out of 2147483647...                                   failed...    1 < > <      _%n>         |  1:STAR(3)                                   SPACE can match 6 times out of 2147483647...                                   failed...    2 <  > <     _%n>         |  1:STAR(3)                                   SPACE can match 5 times out of 2147483647...                                   failed...    3 <   > <    _%n>         |  1:STAR(3)                                   SPACE can match 4 times out of 2147483647...                                   failed...    4 <    > <   _%n>         |  1:STAR(3)                                   SPACE can match 3 times out of 2147483647...                                   failed...    5 <     > <  _%n>         |  1:STAR(3)                                   SPACE can match 2 times out of 2147483647...                                   failed...    6 <      > < _%n>         |  1:STAR(3)                                   SPACE can match 1 times out of 2147483647...                                   failed...    8 <       _> <%n>         |  1:STAR(3)                                   SPACE can match 1 times out of 2147483647...    8 <       _> <%n>         |  3:  EXACT <\n>(5)    9 <       _%n> <>         |  5:  END(0) Match successful!

Matching REx "\s+\n" against "       _%n" Matching stclass SPACE against "       _" (8 bytes)    0 <> <       _%n>         |  1:PLUS(3)                                   SPACE can match 7 times out of 2147483647...                                   failed...

I know Perl 5.10+ will immediately fail the regex (without running it) if a newline is not present. I suspect it is using the location of the newline to reduce the amount of searching it does. For all cases above it seems to cleverly reduce the backtracking involved (usually /\s*\n/ against a string of spaces would take exponential-time). Can anyone offer insight into why the \s+ version is so much faster?

Also note that \s*? does not offer any speedup.

867

asked Jul 18 '16 08:07

rjh

2 Answers

First, even if the resulting regex will not keep the same meaning, let's reduces regexes to \s*0 and \s+0 and use (" " x 4) . "_0" as an input. For the sceptics, you can see here that the lag is still present.

Now let's consider the following code:

$x = (" " x 4) . "_ 0"; $x =~ /\s*0/; # The slow line  $x =~ /\s+0/; # The fast line

Digging a bit with use re debugcolor; we get the following output:

Guessing start of match in sv for REx "\s*0" against "    _0" Found floating substr "0" at offset 5... start_shift: 0 check_at: 5 s: 0 endpos: 6 checked_upto: 0 Does not contradict STCLASS... Guessed: match at offset 0 Matching REx "\s*0" against "    _0" Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes)    0 <    _0>|  1:STAR(3)                                   POSIXD[\s] can match 4 times out of 2147483647...                                   failed...    1 <    _0>|  1:STAR(3)                                   POSIXD[\s] can match 3 times out of 2147483647...                                   failed...    2 <    _0>|  1:STAR(3)                                   POSIXD[\s] can match 2 times out of 2147483647...                                   failed...    3 <    _0>|  1:STAR(3)                                   POSIXD[\s] can match 1 times out of 2147483647...                                   failed...    5 <    _0>|  1:STAR(3)                                   POSIXD[\s] can match 0 times out of 2147483647...    5 <    _0>|  3:  EXACT <0>(5)    6 <    _0>|  5:  END(0) Match successful!  -----------------------  Guessing start of match in sv for REx "\s+0" against "    _0" Found floating substr "0" at offset 5... start_shift: 1 check_at: 5 s: 0 endpos: 5 checked_upto: 0 Does not contradict STCLASS... Guessed: match at offset 0 Matching REx "\s+0" against "    _0" Matching stclass POSIXD[\s] against "    _" (5 bytes)    0 <    _0>|  1:PLUS(3)                                   POSIXD[\s] can match 4 times out of 2147483647...                                   failed... Contradicts stclass... [regexec_flags] Match failed

Perl seems to be optimized for failure. It will first look for constant strings (which only consume O(N)). Here, it'll look for 0 : Found floating substr "0" at offset 5...

Then it'll start with the variable part of the regex, respectivly \s* and \s+, against the whole minimum string to check:

Matching REx "\s*0" against "    _0" Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes) Matching REx "\s+0" against "    _0" Matching stclass POSIXD[\s] against "    _" (5 bytes) # Only 5 bytes because there should be at least 1 "\s" char

After that it'll look for the first position meeting the stclass requirement, here at position 0.

\s*0:
- starts at 0, find 4 spaces then fail;
- starts at 1, find 3 spaces then fail;
- starts at 2, find 2 spaces then fail;
- starts at 3, find 1 spaces then fail;
- starts at 4, find 0 spaces then doesn't fail;
- Find an exact 0
\s+0:
- starts at 0, find 4 spaces then fail. As the minimum number of spaces is not matched, the regex fails instantly.

If you want to have fun with Perl regex optimization, you can consider the following regexes / *\n and / * \n. At first glance, they look the same, have the same meaning... But if you run its against (" " x 40000) . "_\n" the first one will check all possibilities while the second one will look for " \n" and fail immediately.

In a vanilla, non-optimized regex engine, both regex can cause catastrophic backtracking, since they need to retry the pattern as it bumps along. However, in the example above, the second doesn't fail with Perl because it have been optimized to find floating substr "0%n"

You can see another example on Jeff Atwood's blog.

Note also that the issue is not about \s consideration but any pattern where xx* is used instead of x+ see example with 0s and also regex explosive quantifiers

With such shorter example the behavior is "findable", but if you start to play with complicated patterns, it's far from easy to spot, for example: Regular expression hangs program (100% CPU usage)

164

answered Oct 08 '22 19:10

Thomas Ayoub

When there is a "plus" node (e.g. \s+) at the beginning of a pattern and the node fails to match, the regex engine skips forward to the point of failure and tries again; with \s*, on the other hand, the engine only advances one character at a time.

Yves Orton explains this optimization nicely here:

The start class optimisation has two modes, "try every valid start position" (doevery) and "flip flop mode" (!doevery) where it trys only the first valid start position in a sequence.

Consider /(\d+)X/ and the string "123456Y", now we know that if we fail to match X after matching "123456" then we will also fail to match after "23456" (assuming no evil tricks are in place, which disable the optimisation anyway), so we know we can skip forward until the check /fails/ and only then start looking for a real match. This is flip-flop mode.

/\s+/ triggers flip-flop mode; /\s*/, /\s\s*/, and /\s\s+/ don't. This optimization can't be applied to "star" nodes like \s* because they can match zero characters, so a failure at one point in a sequence isn't indicative of failure later in the same sequence.

You can see this in the debug output for each regex. I've highlighted the skipped characters with ^. Compare this (skips four characters at a time):

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d+x/'    ...    0 <> <123 456 78>         |  1:PLUS(3)                                   POSIXD[\d] can match 3 times out of 2147483647...                                   failed...    4 <123 > <456 789 x>      |  1:PLUS(3)       ^^^^                                   POSIXD[\d] can match 3 times out of 2147483647...                                   failed...    8 <23 456 > <789 x>       |  1:PLUS(3)          ^^^^                                   POSIXD[\d] can match 3 times out of 2147483647...                                   failed...

to this (skips one or two characters at a time):

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d*x/'    ...    0 <> <123 456 78>         |  1:STAR(3)                                   POSIXD[\d] can match 3 times out of 2147483647...                                   failed...    1 <1> <23 456 789>        |  1:STAR(3)       ^                                   POSIXD[\d] can match 2 times out of 2147483647...                                   failed...    2 <12> <3 456 789 >       |  1:STAR(3)        ^                                   POSIXD[\d] can match 1 times out of 2147483647...                                   failed...    4 <123 > <456 789 x>      |  1:STAR(3)         ^^                                   POSIXD[\d] can match 3 times out of 2147483647...                                   failed...    5 <123 4> <56 789 x>      |  1:STAR(3)           ^                                   POSIXD[\d] can match 2 times out of 2147483647...                                   failed...    6 <23 45> <6 789 x>       |  1:STAR(3)           ^                                   POSIXD[\d] can match 1 times out of 2147483647...                                   failed...    8 <23 456 > <789 x>       |  1:STAR(3)            ^^                                   POSIXD[\d] can match 3 times out of 2147483647...                                   failed...    9 <23 456 7> <89 x>       |  1:STAR(3)              ^                                   POSIXD[\d] can match 2 times out of 2147483647...                                   failed...   10 <23 456 78> <9 x>       |  1:STAR(3)               ^                                   POSIXD[\d] can match 1 times out of 2147483647...                                   failed...   12 <23 456 789 > <x>       |  1:STAR(3)                ^^                                   POSIXD[\d] can match 0 times out of 2147483647...   12 <23 456 789 > <x>       |  3:  EXACT <x>(5)   13 <23 456 789 x> <>       |  5:  END(0)

Note that the optimization isn't applied to /\s\s+/, because \s+ isn't at the beginning of the pattern. Both /\s\s+/ (logically equivalent to /\s{2,}/) and /\s\s*/ (logically equivalent to /\s+/) probably could be optimized, though; it might make sense to ask on perl5-porters whether either would be worth the effort.

_{In case you're interested, "flip-flop mode" is enabled by setting the PREGf_SKIP flag on a regex when it's compiled. See the code around lines 7344 and 7405 in regcomp.c and line 1585 in regexec.c in the 5.24.0 source.}

answered Oct 08 '22 20:10

ThisSuitIsBlackNot

Related questions
                            
                                Find lines not starting with " in Notepad++
                            
                                Extract string from string using RegEx in the Terminal [duplicate]
                            
                                Java: Split string when an uppercase letter is found
                            
                                Why is this regex allowing a caret?
                            
                                grep: group capturing
                            
                                How to put variable in regular expression match?
                            
                                regex match any single character (one character only)
                            
                                Regex with replace in Golang
                            
                                Is there an online RegexBuddy-like regular expression analyzer? [closed]
                            
                                Finding a DOI in a document or page
                            
                                How to do named capture in ruby
                            
                                Extract info inside all parenthesis in R
                            
                                How to loop all the elements that match the regex?
                            
                                Regex to split a CSV
                            
                                Can not extract the capture group with either sed or grep
                            
                                Validate a file name on Windows
                            
                                Visual Studio Code - Removing Lines Containing criteria
                            
                                regular expression for Indian mobile numbers
                            
                                Matching numbers with regular expressions — only digits and commas
                            
                                Are Perl regexes turing complete?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With