why python regex is so slow?

Tags:

regex

After long debugging I found why my application using python regexps is slow. Here is something I find surprising:

import datetime
import re

pattern = re.compile('(.*)sol(.*)')

lst = ["ciao mandi "*10000 + "sol " + "ciao mandi "*10000,
       "ciao mandi "*1000 + "sal " + "ciao mandi "*1000]
for s in lst:
    print "string len", len(s)
    start = datetime.datetime.now()
    re.findall(pattern,s)
    print "time spent", datetime.datetime.now() - start
    print

The output on my machine is:

string len 220004
time spent 0:00:00.002844

string len 22004
time spent 0:00:05.339580

The first test string is 220K long, matches, and the matching is quite fast. The second test string is 20K long, does not match and it takes 5 seconds to compute!

I knew this report http://swtch.com/~rsc/regexp/regexp1.html which says that regexp implementation in python, perl, ruby is somewhat non optimal... Is this the reason? I didn't thought it could happen with such a simple expression.

added My original task is to split a string trying different regex in turn. Something like:

for regex in ['(.*)sol(.*)', '\emph{([^{}])*)}(.*)', .... ]:
    lst = re.findall(regex, text) 
    if lst:
        assert len(lst) == 1
        assert len(lst[0]) == 2
        return lst[0]

This is to explain why I cannot use split. I solved my issue by replacing (.*)sol(.*) with (.*?)sol(.*) as suggested by Martijn.

Probably I should use match instead of findall... but I don't think this would have solved the issue since the regexp is going to match the entire input and hence findall should stop at first match.

Anyway my question was more about how easy is to fall in this problem for a regexp newbie... I think it is not so simple to understand that (.*?)sol(.*) is the solution (and for example (.*?)sol(.*?) is not).

246

asked Oct 06 '14 10:10

Emanuele Paolini

2 Answers

The Thompson NFA approach changes regular expressions from default greedy to default non-greedy. Normal regular expression engines can do the same; simply change .* to .*?. You should not use greedy expressions when non-greedy will do.

Someone already built an NFA regular expression parser for Python: https://github.com/xysun/regex

It indeed outperforms the default Python regular expression parser for the pathological cases. However, it under-performs on everything else:

This regex engine underperforms Python's re module on normal inputs (using Glenn Fowler's test suite -- see below)

Fixing the pathological case at the expense of the typical is probably a good reason not to use the NFA approach as a default engine, not when the pathological case can simply be avoided instead.

Another reason is that certain features (such as back references) are either very hard or impossible to implement using the NFA approach. Also see the response on the Python Ideas mailing list.

As such, your test can be made to perform much better if you made at least one of the patterns non-greedy to avoid the catastrophic backtracking:

pattern = re.compile('(.*?)sol(.*)')

or don't use a regex at all; you could use str.partition() to get the prefix and postfix instead:

before, sol, after = s.partition('sol')

e.g. not all text problems are regular-expression shaped, so put down that hammer and look at the rest of your toolbox!

In addition, you could perhaps look at the alternative re module, regex. This module implements some basic checks for pathological cases and avoids them deftly, without having to resort to a Thompson NFA implementation. Quoting an entry to a Python bug report tracking regex:

The internal engine no longer interprets a form of bytecode but instead follows a linked set of nodes, and it can work breadth-wise as well as depth-first, which makes it perform much better when faced with one of those 'pathological' regexes.

This engine can run your pathological case more than 100 thousand times faster:

>>> import re, regex
>>> import timeit
>>> p_re = re.compile('(.*)sol(.*)')
>>> p_regex = regex.compile('(.*)sol(.*)')
>>> s = "ciao mandi "*1000 + "sal " + "ciao mandi "*1000
>>> timeit.timeit("p.findall(s)", 'from __main__ import s, p_re as p', number=1)
2.4578459990007104
>>> timeit.timeit("p.findall(s)", 'from __main__ import s, p_regex as p', number=100000)
1.955532732012216

Note the numbers; I limited the re test to 1 run and it took 2.46 seconds, while the regex test runs 100k times in under 2 seconds.

184

answered Oct 17 '22 22:10

Martijn Pieters

I think this has nothing to do with catastrophic backtracking (or at least my own understanding of it).

The problem is caused by the first (.*) in (.*)sol(.*), and the fact that the regex is not anchored anywhere.

re.findall(), after failing at index 0, would retry at index 1, 2, etc. until the end of the string.

badbadbadbad...bad
^                   Attempt to match (.*)sol(.*) from index 0. Fail
 ^                  Attempt to match (.*)sol(.*) from index 1. Fail
  ^                 Attempt to match (.*)sol(.*) from index 2. Fail (and so on)

It effectively has quadratic complexity O(n²) where n is the length of the string.

The problem can be resolved by anchoring your pattern, so it fails right away at positions that your pattern has no chance to match. (.*)sol(.*) will search for sol within a line of text (delimited by line terminator), so if it can't find a match at the start of the line, it won't find any for the rest of the line.

Therefore, you can use:

^(.*)sol(.*)

with re.MULTILINE option.

Running this test code (modified from yours):

import datetime
import re

pattern = re.compile('^(.*)sol(.*)', re.MULTILINE)

lst = ["ciao mandi "*10000 + "sol " + "ciao mandi "*10000,
       "ciao mandi "*10000 + "sal " + "ciao mandi "*10000]
for s in lst:
    print "string len", len(s)
    start = datetime.datetime.now()
    re.findall(pattern,s)
    print "time spent", datetime.datetime.now() - start
    print

(Note that both passing and failing are 220004 characters)

Gives the following result:

string len 220004
time spent 0:00:00.002000

string len 220004
time spent 0:00:00.005000

This demonstrates clearly that both cases now have the same order of magnitude.

answered Oct 17 '22 21:10

nhahtdh

Related questions
                            
                                Whats the best way to write python code into a python file?
                            
                                IPython Maintain namespace after run
                            
                                Python ctypes: how to free memory? Getting invalid pointer error
                            
                                How can I render 3D histograms in python?
                            
                                Can a Python package name start with a number?
                            
                                Don't show long options twice in print_help() from argparse
                            
                                PLS-DA algorithm in python
                            
                                How to convert a list in to queryset django
                            
                                How to alter the time in python?
                            
                                Python PIL - Draw Circle [duplicate]
                            
                                Using bisect in a list of tuples?
                            
                                Defining multiple plots to be animated with a for loop in matplotlib
                            
                                Make an action when the QLineEdit text is changed (programmatically)
                            
                                How to choose which version of python runs from terminal?
                            
                                Colouring edges by weight in networkx
                            
                                matplotlib: log transform counts in hist2d
                            
                                python flask browsing through directory with files
                            
                                How to know if a user has pressed the Enter key using Python
                            
                                How to obtain CSS value (eg. color) of element using selenium webdriver in Python
                            
                                How can you clear a Matplotlib text box that was previously drawn?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With