Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expressions in Python unexpectedly slow

Consider this Python code:

import timeit import re  def one():         any(s in mystring for s in ('foo', 'bar', 'hello'))  r = re.compile('(foo|bar|hello)') def two():         r.search(mystring)   mystring="hello"*1000 print([timeit.timeit(k, number=10000) for k in (one, two)]) mystring="goodbye"*1000 print([timeit.timeit(k, number=10000) for k in (one, two)]) 

Basically, I'm benchmarking two ways to check existence of one of several substrings in a large string.

What I get here (Python 3.2.3) is this output:

[0.36678314208984375, 0.03450202941894531] [0.6672089099884033, 3.7519450187683105] 

In the first case, the regular expression easily defeats the any expression - the regular expression finds the substring immediately, while the any has to check the whole string a couple of times before it gets to the correct substring.

But what's going on in the second example? In the case where the substring isn't present, the regular expression is surprisingly slow! This surprises me, since theoretically the regex only has to go over the string once, while the any expression has to go over the string 3 times. What's wrong here? Is there a problem with my regex, or are Python regexs simply slow in this case?

like image 323
cha0site Avatar asked Jun 25 '12 14:06

cha0site


People also ask

Why is Python regex slow?

until the end of the string. It effectively has quadratic complexity O(n2) where n is the length of the string. The problem can be resolved by anchoring your pattern, so it fails right away at positions that your pattern has no chance to match.

Why is my regex so slow?

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

Does regex affect performance?

Being more specific with your regular expressions, even if they become much longer, can make a world of difference in performance. The fewer characters you scan to determine the match, the faster your regexes will be.


2 Answers

Note to future readers

I think the correct answer is actually that Python's string handling algorithms are really optimized for this case, and the re module is actually a bit slower. What I've written below is true, but is probably not relevant to the simple regexps I have in the question.

Original Answer

Apparently this is not a random fluke - Python's re module really is slower. It looks like it uses a recursive backtracking approach when it fails to find a match, as opposed to building a DFA and simulating it.

It uses the backtracking approach even when there are no back references in the regular expression!

What this means is that in the worst case, Python regexs take exponential, and not linear, time!

This is a very detailed paper describing the issue: http://swtch.com/~rsc/regexp/regexp1.html

I think this graph near the end summarizes it succinctly: graph of performance of various regular expression implementations, time vs. string length

like image 132
cha0site Avatar answered Sep 19 '22 06:09

cha0site


My coworker found the re2 library (https://code.google.com/p/re2/)? There is a python wrapper. It's a bit to get installed on some systems.

I was having the same issue with some complex regexes and long strings -- re2 sped the processing time up significantly -- from seconds to milliseconds.

like image 29
Annie B Avatar answered Sep 21 '22 06:09

Annie B