I analyzed these two regexes using regex101. I think the backtrack of /\S+:/
is right. But I can't understand that difference. Am I wrong?
Backtracking occurs when a regular expression pattern contains optional quantifiers or alternation constructs, and the regular expression engine returns to a previous saved state to continue its search for a match.
Regular expressions are particularly useful for defining filters. Regular expressions contain a series of characters that define a pattern of text to be matched—to make a filter more specialized, or general.
Using character sets For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.
By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.
This is a pcre optimization called auto-possessification
.
From http://pcre.org/pcre.txt:
PCRE's "auto-possessification" optimization usually applies to character repeats at the end of a pattern (as well as internally). For example, the pattern "
a\d+
" is compiled as if it were "a\d++
" because there is no point even considering the possibility of backtracking into the repeated digits.
and
This is an optimization that, for example, turns
a+b
intoa++b
in order to avoid backtracks intoa+
that can never be successful.
Since :
is not included in \w
, your pattern is interpretted as \w++:
(the second +
prevents backtracking, see possessive quantifiers). The extra backtracking states are avoided because there isn't another state where it could possibly match.
On the other hand, :
is included in \S
, so this optimization does not apply for the second case.
You can see the difference using pcretest
(there's a Windows version you can download here).
The pattern /\w+:/
takes 11 steps and outputs:
/\w+:/
--->get accept:
+0 ^ \w+
+3 ^ ^ :
+0 ^ \w+
+3 ^ ^ :
+0 ^ \w+
+3 ^^ :
+0 ^ \w+
+0 ^ \w+
+3 ^ ^ :
+4 ^ ^ .*
+6 ^ ^
0: accept:
However, if we use the control verb (*NO_AUTO_POSSESS)
, which disables this optimization, the pattern /(*NO_AUTO_POSSESS)\w+:/
takes 14 steps and outputs:
/(*NO_AUTO_POSSESS)\w+:/
--->get accept:
+18 ^ \w+
+21 ^ ^ :
+21 ^ ^ :
+21 ^^ :
+18 ^ \w+
+21 ^ ^ :
+21 ^^ :
+18 ^ \w+
+21 ^^ :
+18 ^ \w+
+18 ^ \w+
+21 ^ ^ :
+22 ^ ^ .*
+24 ^ ^
0: accept:
- It takes 1 step less than \S+
, as expected, because \w+
does not match :
.
Unfortunately regex101 does not support this verb.
Update: regex101 now supports this verb, here's the link to the 3 cases to compare:
/\S+:/
(14 steps) - https://regex101.com/r/cw7hGh/1/debugger
/\w+:/
(10 steps) - https://regex101.com/r/cw7hGh/2/debugger
/(*NO_AUTO_POSSESS)\w+:/
(13 steps) - https://regex101.com/r/cw7hGh/3/debugger
regex101 debugger:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With