Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prevent backtracking on regex to find non-comment lines (not starting with indented '#')

I'd like to search for lines that don't start with a pound sign (#) on indented code.

Currently, I'm using the regex ^\s*([^\s#].*) with multiline option on.

My problem is that on non commented lines it works perfectly.

On commented lines the regex engine performs a backtrack due to \s* all the way from the comment sign to the start of the line, which can sometimes cause 40 or 50 backtrack steps.

The regex works perfectly on python code. It's just not very efficient due to the backtracking caused by the engine.

Any idea as of how to avoid it?


Bonus: It's rather funny that the regex engine doesn't recognize the fact that it's searching for [^\s] one by one in \s* and causes this amount of backtracking. What are the challenges in making the re engine work so?

Bonus 2: Using only the stdlib re module. As I cannot add 3rd parties. (I'm technically searching using sublime text but want to know how to generally do it in Python)

like image 385
Bharel Avatar asked Feb 04 '18 17:02

Bharel


People also ask

What is backtrack in regex?

Backtracking occurs when a regular expression pattern contains optional quantifiers or alternation constructs, and the regular expression engine returns to a previous saved state to continue its search for a match.

How to match backslash in regex python?

In short, to match a literal backslash, one has to write '\\\\' as the RE string, because the regular expression must be "\\", and each backslash must be expressed as "\\" inside a regular Python string literal.

Which regular expression would match a blank line in a plain text file?

The most portable regex would be ^[ \t\n]*$ to match an empty string (note that you would need to replace \t and \n with tab and newline accordingly) and [^ \n\t] to match a non-whitespace string.


2 Answers

Use atomic feature of lookarounds to avoid backtrack:

^(?=(\s*))\1([^#].*)
    ^^^^^  ^

This usage is simplified in a negative lookahead which is proposed by @vks beautifully.

or possessive quantifiers while using regex module:

^\s*+([^#].*)

or even atomic groups:

^(?>\s*)([^#].*)

Sublime Text supports all three since being on PCRE.

and for bonus part, no it's not funny. If you be more eagle-eye on it you'll see it's not [^\s] which is literally equal to \S but it is a little bit different: [^\s#] which for engine means it has two different paths at each step to look for so it backtracks to reach one.

like image 56
revo Avatar answered Sep 19 '22 18:09

revo


You can simply say

^(?!\s*#).*

This takes just 6 steps in comparison to 33 steps taken by yours.

like image 23
vks Avatar answered Sep 19 '22 18:09

vks