Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does python's re.search method hang?

Tags:

python

regex

I'm using python regex library to parse some strings and currently I found that my regex is either too complicated or the string I'm searching is too long.

Here's an example of the hang up:

>>> import re
>>> reg = "(\w+'?\s*)+[-|~]\s*((\d+\.?\d+\$?)|(\$?\d+\.?\d+))"
>>> re.search(reg, "**LOOKING FOR PAYPAL OFFERS ON THESE PAINTED UNCOMMONS**") #Hangs here...

I'm not sure what's going on. Any help appreciated!

EDIT: Here's a link with examples of what I'm trying to match: Regxr

like image 623
Gawndy Avatar asked Apr 09 '17 18:04

Gawndy


People also ask

How RegEx works in Python?

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.

What does re search mean in Python?

The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise.

Can we use RegEx in Python?

Python has a module named re to work with regular expressions. To use it, we need to import the module. The module defines several functions and constants to work with RegEx.

Which flag will make your search case insensitive Python?

IGNORECASE : This flag allows for case-insensitive matching of the Regular Expression with the given string i.e. expressions like [A-Z] will match lowercase letters, too.


1 Answers

The reason why the code execution hangs is catastrophic backtracking due to one obligatory and 1+ optional patterns (those that can match an empty string) inside a quantified group (\w+'?\s*)+ that allows a regex engine to test a lot of matching paths, so many that it takes too long to complete.

I suggest unwrapping the problematic group in such a way that ' or \s become obligatory and wrap them in an optional group:

(\w+(?:['\s]+\w+)*)\s*[-~]\s*(\$?\d+(?:\.\d+)?\$?)
^^^^^^^^^^^^^^^^^^^***

See the regex demo

Here, (\w+(?:['\s]+\w+)*) will match 1+ word chars, and then 0+ sequences of 1+ ' or whitespaces followed with 1+ word chars. This way, the pattern becomes linear and the regex engine fails the match quicker if a non-matching string occurs.

The rest of the pattern:

  • \s*[-~]\s* - either - or ~ wrapped with 0+ whitespaces
  • (\$?\d+(?:\.\d+)?\$?) - Group 2 capturing
    • \$? - 1 or 0 $ symbols
    • \d+ - 1+ digits
    • (?:\.\d+)? - 1 or 0 zero sequences of:
      • \. - a dot
      • \d+ - 1+ digits
    • \$? - 1 or 0 $ symbols
like image 193
Wiktor Stribiżew Avatar answered Oct 17 '22 04:10

Wiktor Stribiżew