I analyzed these two regexes using regex101. I think the backtrack of <code>/\S+:/</code> is right. But I can't understand that difference. Am I wrong? <img src="https://i.stack.imgur.com/wH0MD.png" alt="regex101.com">

This is a pcre optimization called <code>auto-possessification</code>. From http://pcre.org/pcre.txt: <blockquote> PCRE's "auto-possessification" optimization usually applies to character repeats at the end of a pattern (as well as internally). For example, the pattern "<code>a\d+</code>" is compiled as if it were "<code>a\d++</code>" because there is no point even considering the possibility of backtracking into the repeated digits. </blockquote> and <blockquote> This is an optimization that, for example, turns <code>a+b</code> into <code>a++b</code> in order to avoid backtracks into <code>a+</code> that can never be successful. </blockquote> Since <code>:</code> is not included in <code>\w</code>, your pattern is interpretted as <code>\w++:</code> (the second <code>+</code> prevents backtracking, see possessive quantifiers). The extra backtracking states are avoided because there isn't another state where it could possibly match. On the other hand, <code>:</code> is included in <code>\S</code>, so this optimization does not apply for the second case. <hr> <h3>PCRETEST</h3> You can see the difference using <code>pcretest</code> (there's a Windows version you can download here). The pattern <code>/\w+:/</code> takes 11 steps and outputs: <pre class="prettyprint"><code>/\w+:/ --->get accept: +0 ^ \w+ +3 ^ ^ : +0 ^ \w+ +3 ^ ^ : +0 ^ \w+ +3 ^^ : +0 ^ \w+ +0 ^ \w+ +3 ^ ^ : +4 ^ ^ .* +6 ^ ^ 0: accept: </code></pre> However, if we use the control verb <code>(*NO_AUTO_POSSESS)</code>, which disables this optimization, the pattern <code>/(*NO_AUTO_POSSESS)\w+:/</code> takes 14 steps and outputs: <pre class="prettyprint"><code>/(*NO_AUTO_POSSESS)\w+:/ --->get accept: +18 ^ \w+ +21 ^ ^ : +21 ^ ^ : +21 ^^ : +18 ^ \w+ +21 ^ ^ : +21 ^^ : +18 ^ \w+ +21 ^^ : +18 ^ \w+ +18 ^ \w+ +21 ^ ^ : +22 ^ ^ .* +24 ^ ^ 0: accept: </code></pre> - It takes 1 step less than <code>\S+</code>, as expected, because <code>\w+</code> does not match <code>:</code>. <hr> <strike>Unfortunately regex101 does not support this verb.</strike> Update: regex101 now supports this verb, here's the link to the 3 cases to compare: <ol> <li><code>/\S+:/</code> (14 steps) - https://regex101.com/r/cw7hGh/1/debugger</li> <li><code>/\w+:/</code> (10 steps) - https://regex101.com/r/cw7hGh/2/debugger</li> <li><code>/(*NO_AUTO_POSSESS)\w+:/</code> (13 steps) - https://regex101.com/r/cw7hGh/3/debugger</li> </ol> regex101 debugger: <img src="https://i.stack.imgur.com/xHnFq.png" alt="regex101.com debugger">

Why do /\w+:/ and /\S+:/ handle backtracking differently?

1 Answers

This is a pcre optimization called auto-possessification.

From http://pcre.org/pcre.txt:

PCRE's "auto-possessification" optimization usually applies to character repeats at the end of a pattern (as well as internally). For example, the pattern "a\d+" is compiled as if it were "a\d++" because there is no point even considering the possibility of backtracking into the repeated digits.

and

This is an optimization that, for example, turns a+b into a++b in order to avoid backtracks into a+ that can never be successful.

Since : is not included in \w, your pattern is interpretted as \w++: (the second + prevents backtracking, see possessive quantifiers). The extra backtracking states are avoided because there isn't another state where it could possibly match.

On the other hand, : is included in \S, so this optimization does not apply for the second case.

PCRETEST

You can see the difference using pcretest (there's a Windows version you can download here).

The pattern /\w+:/ takes 11 steps and outputs:

/\w+:/
--->get accept:
 +0 ^               \w+
 +3 ^  ^            :
 +0  ^              \w+
 +3  ^ ^            :
 +0   ^             \w+
 +3   ^^            :
 +0    ^            \w+
 +0     ^           \w+
 +3     ^     ^     :
 +4     ^      ^    .*
 +6     ^      ^    
 0: accept:

However, if we use the control verb (*NO_AUTO_POSSESS), which disables this optimization, the pattern /(*NO_AUTO_POSSESS)\w+:/ takes 14 steps and outputs:

/(*NO_AUTO_POSSESS)\w+:/
--->get accept:
+18 ^               \w+
+21 ^  ^            :
+21 ^ ^             :
+21 ^^              :
+18  ^              \w+
+21  ^ ^            :
+21  ^^             :
+18   ^             \w+
+21   ^^            :
+18    ^            \w+
+18     ^           \w+
+21     ^     ^     :
+22     ^      ^    .*
+24     ^      ^    
 0: accept:

^{- It takes 1 step less than \S+, as expected, because \w+ does not match :.}

~~Unfortunately regex101 does not support this verb.~~

Update: regex101 now supports this verb, here's the link to the 3 cases to compare:

/\S+:/ (14 steps) - https://regex101.com/r/cw7hGh/1/debugger
/\w+:/ (10 steps) - https://regex101.com/r/cw7hGh/2/debugger
/(*NO_AUTO_POSSESS)\w+:/ (13 steps) - https://regex101.com/r/cw7hGh/3/debugger

regex101 debugger:

regex101.com debugger

answered Oct 01 '22 02:10

Mariano

Related questions
                            
                                Invalid preceding regular expression given by sed
                            
                                Sinatra with optional query parameters
                            
                                What does replace do if no match is found? (under the hood)
                            
                                Case sensitive RLIKE
                            
                                php sentence boundaries detection [duplicate]
                            
                                Multiple negative lookbehind assertions in python regex?
                            
                                Regex Space character in Sed
                            
                                Confusing with the usage of regex in Python
                            
                                Nano insert newline in search and replace
                            
                                RegEx to remove all markup between <a and </a> tags except for within [ and ]
                            
                                Python re.sub multiline on string
                            
                                Java: how to check if character belongs to a specific unicode block?
                            
                                Regular expression: who's greedier?
                            
                                What does preg stand for in PHP's functions?
                            
                                DOTALL for String.matches()
                            
                                Case sensitive string replacement in Eclipse or Notepad++
                            
                                Regex match for optional trailing slash
                            
                                Vim regex to substitute/escape pipe characters
                            
                                VBA RegExp causes Compile error while vbscript.regexp works
                            
                                Regex validation with WTForms and python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why do /\w+:/ and /\S+:/ handle backtracking differently?

Tags:

regex

pcre

backtracking

Mr.kang

People also ask

1 Answers

PCRETEST

Mariano

Recent Activity

Donate For Us