Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do /\w+:/ and /\S+:/ handle backtracking differently?

I analyzed these two regexes using regex101. I think the backtrack of /\S+:/ is right. But I can't understand that difference. Am I wrong?

regex101.com

like image 830
Mr.kang Avatar asked Nov 14 '15 07:11

Mr.kang


People also ask

What is backtracking in regular expressions?

Backtracking occurs when a regular expression pattern contains optional quantifiers or alternation constructs, and the regular expression engine returns to a previous saved state to continue its search for a match.

Why regular expression is required?

Regular expressions are particularly useful for defining filters. Regular expressions contain a series of characters that define a pattern of text to be matched—to make a filter more specialized, or general.

What is the use of given statement in regular expression a za Z?

Using character sets For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.

What will the regular expression match?

By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.


1 Answers

This is a pcre optimization called auto-possessification.

From http://pcre.org/pcre.txt:

PCRE's "auto-possessification" optimization usually applies to character repeats at the end of a pattern (as well as internally). For example, the pattern "a\d+" is compiled as if it were "a\d++" because there is no point even considering the possibility of backtracking into the repeated digits.

and

This is an optimization that, for example, turns a+b into a++b in order to avoid backtracks into a+ that can never be successful.

Since : is not included in \w, your pattern is interpretted as \w++: (the second + prevents backtracking, see possessive quantifiers). The extra backtracking states are avoided because there isn't another state where it could possibly match.

On the other hand, : is included in \S, so this optimization does not apply for the second case.


PCRETEST

You can see the difference using pcretest (there's a Windows version you can download here).

The pattern /\w+:/ takes 11 steps and outputs:

/\w+:/
--->get accept:
 +0 ^               \w+
 +3 ^  ^            :
 +0  ^              \w+
 +3  ^ ^            :
 +0   ^             \w+
 +3   ^^            :
 +0    ^            \w+
 +0     ^           \w+
 +3     ^     ^     :
 +4     ^      ^    .*
 +6     ^      ^    
 0: accept:

However, if we use the control verb (*NO_AUTO_POSSESS), which disables this optimization, the pattern /(*NO_AUTO_POSSESS)\w+:/ takes 14 steps and outputs:

/(*NO_AUTO_POSSESS)\w+:/
--->get accept:
+18 ^               \w+
+21 ^  ^            :
+21 ^ ^             :
+21 ^^              :
+18  ^              \w+
+21  ^ ^            :
+21  ^^             :
+18   ^             \w+
+21   ^^            :
+18    ^            \w+
+18     ^           \w+
+21     ^     ^     :
+22     ^      ^    .*
+24     ^      ^    
 0: accept:

- It takes 1 step less than \S+, as expected, because \w+ does not match :.


Unfortunately regex101 does not support this verb.

Update: regex101 now supports this verb, here's the link to the 3 cases to compare:

  1. /\S+:/ (14 steps) - https://regex101.com/r/cw7hGh/1/debugger

  2. /\w+:/ (10 steps) - https://regex101.com/r/cw7hGh/2/debugger

  3. /(*NO_AUTO_POSSESS)\w+:/ (13 steps) - https://regex101.com/r/cw7hGh/3/debugger

regex101 debugger:

regex101.com debugger

like image 82
Mariano Avatar answered Oct 01 '22 02:10

Mariano