Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression that never finishes running

Tags:

python

regex

I wrote a small, naive regular expression that was supposed to find text inside parentheses:

re.search(r'\((.|\s)*\)', name)

I know this is not the best way to do it for a few reasons, but it was working just fine. What I am looking for is simply an explanation as to why for some strings this expression starts taking exponentially longer and then never finishes. Last night, after running this code for months, one of our servers suddenly got stuck matching a string similar to the following:

x (y)                                            z

I've experimented with it and determined that the time taken doubles for every space in between the 'y' and 'z':

In [62]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (22 * ' ') + 'z')
1 loops, best of 3: 1.23 s per loop

In [63]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (23 * ' ') + 'z')
1 loops, best of 3: 2.46 s per loop

In [64]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (24 * ' ') + 'z')
1 loops, best of 3: 4.91 s per loop

But also that characters other than a space do not have the same effect:

In [65]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (24 * 'a') + 'z')
100000 loops, best of 3: 5.23 us per loop

Note: I am not looking for a better regular expression or another solution to this problem. We are no longer using it.

like image 820
fletom Avatar asked Oct 05 '12 15:10

fletom


People also ask

What is ?! In regex?

It's a negative lookahead, which means that for the expression to match, the part within (?!...) must not match. In this case the regex matches http:// only when it is not followed by the current host name (roughly, see Thilo's comment).

What does * do in regex?

The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o' s.

How do you repeat a regular expression?

An expression followed by '*' can be repeated any number of times, including zero. An expression followed by '+' can be repeated any number of times, but at least once. An expression followed by '? ' may be repeated zero or one times only.


2 Answers

Catastrophic Backtracking

As CaffGeek's answer correctly implies, the problem is due to one form of catastrophic backtracking. The two alternatives both match a space (or tab) and this is applied unlimited times greedily. Additionally the dot matches the closing parentheses so once the opening paren is matched this expression always matches to the very end of the string before it must painstakingly backtrack to find the closing bracket. And during this backtracking process, the other alternative is tried at each location for (which is also successful for spaces or tabs). Thus, every possible matching combination sequence must be tried before the engine can backtrack one position. With a lot of spaces after the closing paren, this adds up quickly. The specific problem for the case where there is a matching close paren can be solved by simply making the star quantifier lazy (i.e. r'\((.|\s)*?\)'), but the runaway regex problem still exists for the non-matching case where there is an opening paren with no matching close paren in the subject string.

The original regex is really, really bad! (and also does not correctly match up closing parens when there are more than one pair).

The correct expression to match innermost parentheses (which is very fast for both matching and non-matching cases), is of course:

re_innermostparens = re.compile(r"""
    \(        # Literal open paren.
    [^()]*    # Zero or more non-parens.
    \)        # Literal close paren.
    """, re.VERBOSE)

All regex authors should read MRE3!

This is all explained in great detail, (with thorough examples and recommended best practices) in Jeffrey Friedl's must-read-for-regex-authors: Mastering Regular Expressions (3rd Edition). I can honestly say that this is the most useful book I've ever read. Regular expressions are a very powerful tool but like a loaded weapon must be applied with great care and precision (or you will shoot yourself in the foot!)

like image 121
ridgerunner Avatar answered Sep 19 '22 13:09

ridgerunner


Probably a ReDos issue.

See: http://en.wikipedia.org/wiki/ReDoS

It's possible to create a regular expression denial of service on poorly constructed regexes.

For example, from here: https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS

This regex ^(a+)+$

enter image description here

For the input aaaaX there are 16 possible paths in the above graph. But for aaaaaaaaaaaaaaaaX there are 65536 possible paths, and the number is double for each additional a. This is an extreme case where the naïve algorithm is problematic, because it must pass on many many paths, and then fail.

I suspect a large part of the problem with your regex is this (.|\s) which is somewhat confusing, as \s is already included in .. But creates an option point.

EDIT: I think this was one of the better explanations of the issue I've read.

http://msdn.microsoft.com/en-us/magazine/ff646973.aspx

like image 34
CaffGeek Avatar answered Sep 18 '22 13:09

CaffGeek