Consider the regular expression
".*?\s*$"
and a string which does not end with white space.
Example
" a"
. The final \s
can never match the a
which is why the
matcher iterates through:
\s\s\s\s\s - fails
.\s\s\s\s - fails
..\s\s\s - fails
...\s\s - fails
....\s - fails
..... - succeeds
If, however, the regex would match leading white space ("^\s*.*?"
) the match would succeed without any backtracking. (And if the original regex would have been matched backwards (i.e. starting with the last character and working backwards), it would also succeed immediately.)
Is there some way to hint/help the engine in the trailing case? Or is it just the way it is?
(I am interested in the linear backtracking at the end. I choose the white space example to illustrate my point. I would use a trim()
function or similarly if I were to remove leading/trailing white space.)
Your demonstration of engine backtracks is very brief, consisting some steps of different levels of backtracking only. Saying that, I'm going through a more explanatory way:
First level of backtracking (4 times): ()\s*
\s\s\s\s$ - fails (backtrack)
\s\s\s$ - fails (backtrack)
\s\s$ - fails (backtrack)
\s$ - fails (backtrack)
$ - fails
Second level of backtracking (3 times): (.)\s*
.\s\s\s$ - fails (backtrack)
.\s\s$ - fails (backtrack)
.\s$ - fails (backtrack)
.$ - fails
Third level of backtracking (2 times): (..)\s*
Forth level of backtracking (1 time): (...)\s*
Fifth level of backtracking (zero-time backtracks of \s*
): (....)\s*
....$ - fails (backtrack)
.....$ - succeeds
Total number of backtracks: 5 (number of backtracks to .*?
) + 10
Most number of backtracks are caused by number of white-spaces within input string and its corresponding pattern \s*
(which is greedy): 10 backtracks.
You can lessen this number of backtracks using atomic groups (if engine supports it) this way:
.*?(?>\s*)$
It matches all white-spaces greedily and never falls into backtracking. So number of steps will be reduced like so:
Engine moves (5 backtracks):
\s\s\s\s$ - fails (backtrack)
.\s\s\s$ - fails (backtrack)
..\s\s$ - fails (backtrack)
...\s$ - fails (backtrack)
....$ - fails (backtrack)
.....$ - succeeds
Note: I didn't include single backtracks on \s*
at above demonstration.
Instead of writing something like .*?\s*$
where .*?
must check if there's a white-space or not before taking each character, you can use character classes and a group (atomic if possible) to limit the impact of the non-greedy quantifier.
In short, you can change .*?\s*$
to something like (?>\s*\S+)*?\s*$
(obviously (?>\s*\S+)*\s*$
or (?:\s*\S+)*+\s*$
is faster and produces the same result).
When you write it this way, \s*$
is only tested after the last non-whitespace position (at the next white-space or at the end of the string).
If the atomic group feature isn't available, you can emulate it like this:
(?>expression) => (?=(expression))\1
Note: for your particular case, you can also change .*?\s*$
to (?:.*\S)?\s*$
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With