My problem is about parsing log files and removing variable parts on each line in order to group them. For instance:
s = re.sub(r'(?i)User [_0-9A-z]+ is ', r"User .. is ", s)
s = re.sub(r'(?i)Message rejected because : (.*?) \(.+\)', r'Message rejected because : \1 (...)', s)
I have about 120+ matching rules like the above.
I have found no performance issues while searching successively on 100 different regexes. But a huge slow down occurs when applying 101 regexes.
The exact same behavior happens when replacing my rules with
for a in range(100):
s = re.sub(r'(?i)caught here'+str(a)+':.+', r'( ... )', s)
It got 20 times slower when using range(101) instead.
# range(100)
% ./dashlog.py file.bz2
== Took 2.1 seconds. ==
# range(101)
% ./dashlog.py file.bz2
== Took 47.6 seconds. ==
Why is such a thing happening? And is there any known workaround ?
(Happens on Python 2.6.6/2.7.2 on Linux/Windows.)
Conclusion. grep is so much faster than the regex engine of Python that even reading the whole file several times does not matter.
A Regular Expression is used for identifying a search pattern in a text string. It also helps in finding out the correctness of the data and even operations such as finding, replacing and formatting the data is possible using Regular Expressions.
Matches zero or more repetitions of the preceding regex. For example, a* matches zero or more 'a' characters. That means it would match an empty string, 'a' , 'aa' , 'aaa' , and so on.
The Python "re" module provides regular expression support.
Python keeps an internal cache for compiled regular expressions. Whenever you use one of the top-level functions that takes a regular expression, Python first compiles that expression, and the result of that compilation is cached.
Guess how many items the cache can hold?
>>> import re
>>> re._MAXCACHE
100
The moment you exceed the cache size, Python 2 clears all cached expressions and starts with a clean cache. Python 3 increased the limit to 512 but still does a full clear.
The work-around is for you to cache the compilation yourself:
compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ')
compiled_expression.sub(r"User .. is ", s)
You could use functools.partial()
to bundle the sub()
call together with the replacement expression:
from functools import partial
compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ')
ready_to_use_sub = partial(compiled_expression.sub, r"User .. is ")
then later on use ready_to_use_sub(s)
to use the compiled regular expression pattern together with a specific replacement pattern.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With