My problem is about parsing log files and removing variable parts on each line in order to group them. For instance: <pre class="prettyprint"><code>s = re.sub(r'(?i)User [_0-9A-z]+ is ', r"User .. is ", s) s = re.sub(r'(?i)Message rejected because : (.*?) \(.+\)', r'Message rejected because : \1 (...)', s) </code></pre> I have about 120+ matching rules like the above. I have found no performance issues while searching successively on 100 different regexes. But a huge slow down occurs when applying 101 regexes. The exact same behavior happens when replacing my rules with <pre class="prettyprint"><code>for a in range(100): s = re.sub(r'(?i)caught here'+str(a)+':.+', r'( ... )', s) </code></pre> It got 20 times slower when using range(101) instead. <pre class="prettyprint"><code># range(100) % ./dashlog.py file.bz2 == Took 2.1 seconds. == # range(101) % ./dashlog.py file.bz2 == Took 47.6 seconds. == </code></pre> Why is such a thing happening? And is there any known workaround ? (Happens on Python 2.6.6/2.7.2 on Linux/Windows.)

Python keeps an internal cache for compiled regular expressions. Whenever you use one of the top-level functions that takes a regular expression, Python first compiles that expression, and the result of that compilation is cached. Guess how many items the cache can hold? <pre class="prettyprint"><code>>>> import re >>> re._MAXCACHE 100 </code></pre> The moment you exceed the cache size, Python 2 clears all cached expressions and starts with a clean cache. Python 3 increased the limit to 512 but still does a full clear. The work-around is for you to cache the compilation yourself: <pre class="prettyprint"><code>compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ') compiled_expression.sub(r"User .. is ", s) </code></pre> You could use <code>functools.partial()</code> to bundle the <code>sub()</code> call together with the replacement expression: <pre class="prettyprint"><code>from functools import partial compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ') ready_to_use_sub = partial(compiled_expression.sub, r"User .. is ") </code></pre> then later on use <code>ready_to_use_sub(s)</code> to use the compiled regular expression pattern together with a specific replacement pattern.

Python re module becomes 20 times slower when looping on more than 100 different regex

Tags:

performance

python

regex

My problem is about parsing log files and removing variable parts on each line in order to group them. For instance:

s = re.sub(r'(?i)User [_0-9A-z]+ is ', r"User .. is ", s)
s = re.sub(r'(?i)Message rejected because : (.*?) \(.+\)', r'Message rejected because : \1 (...)', s)

I have about 120+ matching rules like the above.

I have found no performance issues while searching successively on 100 different regexes. But a huge slow down occurs when applying 101 regexes.

The exact same behavior happens when replacing my rules with

for a in range(100):
    s = re.sub(r'(?i)caught here'+str(a)+':.+', r'( ... )', s)

It got 20 times slower when using range(101) instead.

# range(100)
% ./dashlog.py file.bz2
== Took  2.1 seconds.  ==

# range(101)
% ./dashlog.py file.bz2
== Took  47.6 seconds.  ==

Why is such a thing happening? And is there any known workaround ?

(Happens on Python 2.6.6/2.7.2 on Linux/Windows.)

440

asked Jun 26 '13 16:06

Wiil

1 Answers

Python keeps an internal cache for compiled regular expressions. Whenever you use one of the top-level functions that takes a regular expression, Python first compiles that expression, and the result of that compilation is cached.

Guess how many items the cache can hold?

>>> import re
>>> re._MAXCACHE
100

The moment you exceed the cache size, Python 2 clears all cached expressions and starts with a clean cache. Python 3 increased the limit to 512 but still does a full clear.

The work-around is for you to cache the compilation yourself:

compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ')

compiled_expression.sub(r"User .. is ", s)

You could use functools.partial() to bundle the sub() call together with the replacement expression:

from functools import partial

compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ')
ready_to_use_sub = partial(compiled_expression.sub, r"User .. is ")

then later on use ready_to_use_sub(s) to use the compiled regular expression pattern together with a specific replacement pattern.

answered Sep 18 '22 15:09

Martijn Pieters

Related questions
                            
                                Configuring Pydev Interpreter in Eclipse to use Enthought Python Distribution
                            
                                Changing python interpreter for emacs
                            
                                "%s" % format vs "{0}".format() vs "?" format
                            
                                SqlAlchemy - Filtering by field defined as a ForeignKey
                            
                                Numpy.Array in Python list?
                            
                                Can I mark variables as transient so they won't be pickled?
                            
                                celery task and customize decorator
                            
                                Debugging Python code in Notepad++
                            
                                Weighted logistic regression in Python
                            
                                How can I get the number of nodes of a Neo4j graph database from Python?
                            
                                Python Regex for hyphenated words
                            
                                PIL - Convert GIF Frames to JPG
                            
                                Remove all occurrences of several chars from a string
                            
                                Python subprocess arguments
                            
                                How to redirect print statements to Tkinter text widget
                            
                                Interleave rows of two numpy arrays in Python
                            
                                Python Difference between x = x+1 and x += 1 [duplicate]
                            
                                Friendly URL for a REST WebService with CherryPy
                            
                                Test if ValidationError was raised
                            
                                Access Django devserver from another machine same network

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With