After long debugging I found why my application using python regexps is slow. Here is something I find surprising:
import datetime
import re
pattern = re.compile('(.*)sol(.*)')
lst = ["ciao mandi "*10000 + "sol " + "ciao mandi "*10000,
"ciao mandi "*1000 + "sal " + "ciao mandi "*1000]
for s in lst:
print "string len", len(s)
start = datetime.datetime.now()
re.findall(pattern,s)
print "time spent", datetime.datetime.now() - start
print
The output on my machine is:
string len 220004
time spent 0:00:00.002844
string len 22004
time spent 0:00:05.339580
The first test string is 220K long, matches, and the matching is quite fast. The second test string is 20K long, does not match and it takes 5 seconds to compute!
I knew this report http://swtch.com/~rsc/regexp/regexp1.html which says that regexp implementation in python, perl, ruby is somewhat non optimal... Is this the reason? I didn't thought it could happen with such a simple expression.
added My original task is to split a string trying different regex in turn. Something like:
for regex in ['(.*)sol(.*)', '\emph{([^{}])*)}(.*)', .... ]:
lst = re.findall(regex, text)
if lst:
assert len(lst) == 1
assert len(lst[0]) == 2
return lst[0]
This is to explain why I cannot use split
. I solved my issue by replacing (.*)sol(.*)
with (.*?)sol(.*)
as suggested by Martijn.
Probably I should use match
instead of findall
... but I don't think this would have solved the issue since the regexp is going to match the entire input and hence findall should stop at first match.
Anyway my question was more about how easy is to fall in this problem for a regexp newbie... I think it is not so simple to understand that (.*?)sol(.*)
is the solution (and for example (.*?)sol(.*?)
is not).
The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.
String operations will always be faster than regular expression operations. Unless, of course, you write the string operations in an inefficient way. Regular expressions have to be parsed, and code generated to perform the operation using string operations.
Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster.
The Thompson NFA approach changes regular expressions from default greedy to default non-greedy. Normal regular expression engines can do the same; simply change .*
to .*?
. You should not use greedy expressions when non-greedy will do.
Someone already built an NFA regular expression parser for Python: https://github.com/xysun/regex
It indeed outperforms the default Python regular expression parser for the pathological cases. However, it under-performs on everything else:
This regex engine underperforms Python's re module on normal inputs (using Glenn Fowler's test suite -- see below)
Fixing the pathological case at the expense of the typical is probably a good reason not to use the NFA approach as a default engine, not when the pathological case can simply be avoided instead.
Another reason is that certain features (such as back references) are either very hard or impossible to implement using the NFA approach. Also see the response on the Python Ideas mailing list.
As such, your test can be made to perform much better if you made at least one of the patterns non-greedy to avoid the catastrophic backtracking:
pattern = re.compile('(.*?)sol(.*)')
or don't use a regex at all; you could use str.partition()
to get the prefix and postfix instead:
before, sol, after = s.partition('sol')
e.g. not all text problems are regular-expression shaped, so put down that hammer and look at the rest of your toolbox!
In addition, you could perhaps look at the alternative re
module, regex
. This module implements some basic checks for pathological cases and avoids them deftly, without having to resort to a Thompson NFA implementation. Quoting an entry to a Python bug report tracking regex
:
The internal engine no longer interprets a form of bytecode but instead follows a linked set of nodes, and it can work breadth-wise as well as depth-first, which makes it perform much better when faced with one of those 'pathological' regexes.
This engine can run your pathological case more than 100 thousand times faster:
>>> import re, regex
>>> import timeit
>>> p_re = re.compile('(.*)sol(.*)')
>>> p_regex = regex.compile('(.*)sol(.*)')
>>> s = "ciao mandi "*1000 + "sal " + "ciao mandi "*1000
>>> timeit.timeit("p.findall(s)", 'from __main__ import s, p_re as p', number=1)
2.4578459990007104
>>> timeit.timeit("p.findall(s)", 'from __main__ import s, p_regex as p', number=100000)
1.955532732012216
Note the numbers; I limited the re
test to 1 run and it took 2.46 seconds, while the regex
test runs 100k times in under 2 seconds.
I think this has nothing to do with catastrophic backtracking (or at least my own understanding of it).
The problem is caused by the first (.*)
in (.*)sol(.*)
, and the fact that the regex is not anchored anywhere.
re.findall()
, after failing at index 0, would retry at index 1, 2, etc. until the end of the string.
badbadbadbad...bad
^ Attempt to match (.*)sol(.*) from index 0. Fail
^ Attempt to match (.*)sol(.*) from index 1. Fail
^ Attempt to match (.*)sol(.*) from index 2. Fail (and so on)
It effectively has quadratic complexity O(n2) where n is the length of the string.
The problem can be resolved by anchoring your pattern, so it fails right away at positions that your pattern has no chance to match. (.*)sol(.*)
will search for sol
within a line of text (delimited by line terminator), so if it can't find a match at the start of the line, it won't find any for the rest of the line.
Therefore, you can use:
^(.*)sol(.*)
with re.MULTILINE
option.
Running this test code (modified from yours):
import datetime
import re
pattern = re.compile('^(.*)sol(.*)', re.MULTILINE)
lst = ["ciao mandi "*10000 + "sol " + "ciao mandi "*10000,
"ciao mandi "*10000 + "sal " + "ciao mandi "*10000]
for s in lst:
print "string len", len(s)
start = datetime.datetime.now()
re.findall(pattern,s)
print "time spent", datetime.datetime.now() - start
print
(Note that both passing and failing are 220004 characters)
Gives the following result:
string len 220004
time spent 0:00:00.002000
string len 220004
time spent 0:00:00.005000
This demonstrates clearly that both cases now have the same order of magnitude.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With