Consider this Python code:
import timeit import re def one(): any(s in mystring for s in ('foo', 'bar', 'hello')) r = re.compile('(foo|bar|hello)') def two(): r.search(mystring) mystring="hello"*1000 print([timeit.timeit(k, number=10000) for k in (one, two)]) mystring="goodbye"*1000 print([timeit.timeit(k, number=10000) for k in (one, two)])
Basically, I'm benchmarking two ways to check existence of one of several substrings in a large string.
What I get here (Python 3.2.3) is this output:
[0.36678314208984375, 0.03450202941894531] [0.6672089099884033, 3.7519450187683105]
In the first case, the regular expression easily defeats the any
expression - the regular expression finds the substring immediately, while the any
has to check the whole string a couple of times before it gets to the correct substring.
But what's going on in the second example? In the case where the substring isn't present, the regular expression is surprisingly slow! This surprises me, since theoretically the regex only has to go over the string once, while the any
expression has to go over the string 3 times. What's wrong here? Is there a problem with my regex, or are Python regexs simply slow in this case?
until the end of the string. It effectively has quadratic complexity O(n2) where n is the length of the string. The problem can be resolved by anchoring your pattern, so it fails right away at positions that your pattern has no chance to match.
The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.
Being more specific with your regular expressions, even if they become much longer, can make a world of difference in performance. The fewer characters you scan to determine the match, the faster your regexes will be.
I think the correct answer is actually that Python's string handling algorithms are really optimized for this case, and the re
module is actually a bit slower. What I've written below is true, but is probably not relevant to the simple regexps I have in the question.
Apparently this is not a random fluke - Python's re
module really is slower. It looks like it uses a recursive backtracking approach when it fails to find a match, as opposed to building a DFA and simulating it.
It uses the backtracking approach even when there are no back references in the regular expression!
What this means is that in the worst case, Python regexs take exponential, and not linear, time!
This is a very detailed paper describing the issue: http://swtch.com/~rsc/regexp/regexp1.html
I think this graph near the end summarizes it succinctly:
My coworker found the re2 library (https://code.google.com/p/re2/)? There is a python wrapper. It's a bit to get installed on some systems.
I was having the same issue with some complex regexes and long strings -- re2 sped the processing time up significantly -- from seconds to milliseconds.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With