Consider this Python code: <pre class="prettyprint"><code>import timeit import re def one(): any(s in mystring for s in ('foo', 'bar', 'hello')) r = re.compile('(foo|bar|hello)') def two(): r.search(mystring) mystring="hello"*1000 print([timeit.timeit(k, number=10000) for k in (one, two)]) mystring="goodbye"*1000 print([timeit.timeit(k, number=10000) for k in (one, two)]) </code></pre> Basically, I'm benchmarking two ways to check existence of one of several substrings in a large string. What I get here (Python 3.2.3) is this output: <pre class="prettyprint"><code>[0.36678314208984375, 0.03450202941894531] [0.6672089099884033, 3.7519450187683105] </code></pre> In the first case, the regular expression easily defeats the <code>any</code> expression - the regular expression finds the substring immediately, while the <code>any</code> has to check the whole string a couple of times before it gets to the correct substring. But what's going on in the second example? In the case where the substring isn't present, the regular expression is surprisingly slow! This surprises me, since theoretically the regex only has to go over the string once, while the <code>any</code> expression has to go over the string 3 times. What's wrong here? Is there a problem with my regex, or are Python regexs simply slow in this case?

My coworker found the re2 library (https://code.google.com/p/re2/)? There is a python wrapper. It's a bit to get installed on some systems. I was having the same issue with some complex regexes and long strings -- re2 sped the processing time up significantly -- from seconds to milliseconds.

Regular Expressions in Python unexpectedly slow

Tags:

python

regex

python-3.x

Consider this Python code:

import timeit import re  def one():         any(s in mystring for s in ('foo', 'bar', 'hello'))  r = re.compile('(foo|bar|hello)') def two():         r.search(mystring)   mystring="hello"*1000 print([timeit.timeit(k, number=10000) for k in (one, two)]) mystring="goodbye"*1000 print([timeit.timeit(k, number=10000) for k in (one, two)])

Basically, I'm benchmarking two ways to check existence of one of several substrings in a large string.

What I get here (Python 3.2.3) is this output:

[0.36678314208984375, 0.03450202941894531] [0.6672089099884033, 3.7519450187683105]

In the first case, the regular expression easily defeats the any expression - the regular expression finds the substring immediately, while the any has to check the whole string a couple of times before it gets to the correct substring.

But what's going on in the second example? In the case where the substring isn't present, the regular expression is surprisingly slow! This surprises me, since theoretically the regex only has to go over the string once, while the any expression has to go over the string 3 times. What's wrong here? Is there a problem with my regex, or are Python regexs simply slow in this case?

323

asked Jun 25 '12 14:06

cha0site

2 Answers

Note to future readers

I think the correct answer is actually that Python's string handling algorithms are really optimized for this case, and the re module is actually a bit slower. What I've written below is true, but is probably not relevant to the simple regexps I have in the question.

Original Answer

Apparently this is not a random fluke - Python's re module really is slower. It looks like it uses a recursive backtracking approach when it fails to find a match, as opposed to building a DFA and simulating it.

It uses the backtracking approach even when there are no back references in the regular expression!

What this means is that in the worst case, Python regexs take exponential, and not linear, time!

This is a very detailed paper describing the issue: http://swtch.com/~rsc/regexp/regexp1.html

I think this graph near the end summarizes it succinctly: graph of performance of various regular expression implementations, time vs. string length

132

answered Sep 19 '22 06:09

cha0site

My coworker found the re2 library (https://code.google.com/p/re2/)? There is a python wrapper. It's a bit to get installed on some systems.

I was having the same issue with some complex regexes and long strings -- re2 sped the processing time up significantly -- from seconds to milliseconds.

answered Sep 21 '22 06:09

Annie B

Related questions
                            
                                isinstance(foo,bar) vs type(foo) is bar
                            
                                What is the type hint for a (any) python module?
                            
                                matplotlib - subplots with fixed aspect ratio
                            
                                Python method name with double-underscore is overridden?
                            
                                what's the tornado ioloop, and tornado's workflow?
                            
                                How to create a SSH tunnel using Python and Paramiko?
                            
                                Fastest way to calculate the centroid of a set of coordinate tuples in python without numpy
                            
                                When to use Category rather than Object?
                            
                                Creating a new column in Panda by using lambda function on two existing columns
                            
                                Tensorflow mean squared error loss function
                            
                                How to change spacing between ticks in matplotlib?
                            
                                Are numpy's basic operations vectorized, i.e. do they use SIMD operations?
                            
                                How to specify a configuration file for pylint under windows?
                            
                                how to test for a regex match
                            
                                What is python's equivalent of R's NA?
                            
                                What are the differences between mysql-connector-python, mysql-connector-python-rf and mysql-connector-repackaged?
                            
                                Why is a False value (0) smaller in bytes than True (1)?
                            
                                What is the difference between jedi and python language server in VS code IDE?
                            
                                How to override the default value of a Model Field from an Abstract Base Class
                            
                                How do I wrap a C++ class with Cython?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regular Expressions in Python unexpectedly slow

Tags:

python

regex

python-3.x

cha0site

People also ask

2 Answers

Note to future readers

Original Answer

cha0site

Annie B

Recent Activity

Donate For Us