Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid (linear) back tracking in a non-greedy regular expression?

Tags:

regex

Consider the regular expression

".*?\s*$"

and a string which does not end with white space.

Example

" a". The final \s can never match the a which is why the matcher iterates through:

\s\s\s\s\s        - fails
.\s\s\s\s         - fails
..\s\s\s          - fails
...\s\s           - fails
....\s            - fails
.....             - succeeds

If, however, the regex would match leading white space ("^\s*.*?") the match would succeed without any backtracking. (And if the original regex would have been matched backwards (i.e. starting with the last character and working backwards), it would also succeed immediately.)

Is there some way to hint/help the engine in the trailing case? Or is it just the way it is?

(I am interested in the linear backtracking at the end. I choose the white space example to illustrate my point. I would use a trim() function or similarly if I were to remove leading/trailing white space.)

like image 249
Micha Wiedenmann Avatar asked Mar 12 '23 08:03

Micha Wiedenmann


2 Answers

How it backtracks

Your demonstration of engine backtracks is very brief, consisting some steps of different levels of backtracking only. Saying that, I'm going through a more explanatory way:

First level of backtracking (4 times): ()\s*

\s\s\s\s$         - fails (backtrack)
\s\s\s$          - fails (backtrack)
\s\s$           - fails (backtrack)
\s$            - fails (backtrack)
$             - fails

Second level of backtracking (3 times): (.)\s*

.\s\s\s$        - fails (backtrack)
.\s\s$         - fails (backtrack)
.\s$          - fails (backtrack)
.$           - fails

Third level of backtracking (2 times): (..)\s*

Forth level of backtracking (1 time): (...)\s*

Fifth level of backtracking (zero-time backtracks of \s*): (....)\s*

....$        - fails (backtrack)
.....$        - succeeds

Total number of backtracks: 5 (number of backtracks to .*?) + 10

Make it less

Most number of backtracks are caused by number of white-spaces within input string and its corresponding pattern \s* (which is greedy): 10 backtracks.

You can lessen this number of backtracks using atomic groups (if engine supports it) this way:

.*?(?>\s*)$

It matches all white-spaces greedily and never falls into backtracking. So number of steps will be reduced like so:

Engine moves (5 backtracks):

\s\s\s\s$        - fails (backtrack)
.\s\s\s$        - fails (backtrack)
..\s\s$        - fails (backtrack)
...\s$        - fails (backtrack)
....$        - fails (backtrack)
.....$        - succeeds

Note: I didn't include single backtracks on \s* at above demonstration.

enter image description here

like image 66
revo Avatar answered Mar 22 '23 23:03

revo


Instead of writing something like .*?\s*$ where .*? must check if there's a white-space or not before taking each character, you can use character classes and a group (atomic if possible) to limit the impact of the non-greedy quantifier.

In short, you can change .*?\s*$ to something like (?>\s*\S+)*?\s*$ (obviously (?>\s*\S+)*\s*$ or (?:\s*\S+)*+\s*$ is faster and produces the same result). When you write it this way, \s*$ is only tested after the last non-whitespace position (at the next white-space or at the end of the string).

If the atomic group feature isn't available, you can emulate it like this:

(?>expression)   =>    (?=(expression))\1 

Note: for your particular case, you can also change .*?\s*$ to (?:.*\S)?\s*$

like image 32
Casimir et Hippolyte Avatar answered Mar 23 '23 01:03

Casimir et Hippolyte