Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python re module becomes 20 times slower when looping on more than 100 different regex

My problem is about parsing log files and removing variable parts on each line in order to group them. For instance:

s = re.sub(r'(?i)User [_0-9A-z]+ is ', r"User .. is ", s)
s = re.sub(r'(?i)Message rejected because : (.*?) \(.+\)', r'Message rejected because : \1 (...)', s)

I have about 120+ matching rules like the above.

I have found no performance issues while searching successively on 100 different regexes. But a huge slow down occurs when applying 101 regexes.

The exact same behavior happens when replacing my rules with

for a in range(100):
    s = re.sub(r'(?i)caught here'+str(a)+':.+', r'( ... )', s)

It got 20 times slower when using range(101) instead.

# range(100)
% ./dashlog.py file.bz2
== Took  2.1 seconds.  ==

# range(101)
% ./dashlog.py file.bz2
== Took  47.6 seconds.  ==

Why is such a thing happening? And is there any known workaround ?

(Happens on Python 2.6.6/2.7.2 on Linux/Windows.)

like image 440
Wiil Avatar asked Jun 26 '13 16:06

Wiil


People also ask

Is regex faster Python?

Conclusion. grep is so much faster than the regex engine of Python that even reading the whole file several times does not matter.

What are the benefits of regex regular expressions in Python?

A Regular Expression is used for identifying a search pattern in a text string. It also helps in finding out the correctness of the data and even operations such as finding, replacing and formatting the data is possible using Regular Expressions.

Is used for zero or more occurences in regex Python?

Matches zero or more repetitions of the preceding regex. For example, a* matches zero or more 'a' characters. That means it would match an empty string, 'a' , 'aa' , 'aaa' , and so on.

Which module in Python supports regular expressions re?

The Python "re" module provides regular expression support.


1 Answers

Python keeps an internal cache for compiled regular expressions. Whenever you use one of the top-level functions that takes a regular expression, Python first compiles that expression, and the result of that compilation is cached.

Guess how many items the cache can hold?

>>> import re
>>> re._MAXCACHE
100

The moment you exceed the cache size, Python 2 clears all cached expressions and starts with a clean cache. Python 3 increased the limit to 512 but still does a full clear.

The work-around is for you to cache the compilation yourself:

compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ')

compiled_expression.sub(r"User .. is ", s)

You could use functools.partial() to bundle the sub() call together with the replacement expression:

from functools import partial

compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ')
ready_to_use_sub = partial(compiled_expression.sub, r"User .. is ")

then later on use ready_to_use_sub(s) to use the compiled regular expression pattern together with a specific replacement pattern.

like image 76
Martijn Pieters Avatar answered Sep 18 '22 15:09

Martijn Pieters