I read a cool article on how to avoid creating slow regular expressions. Generally speaking it looks like the longer and more explicit and regex is the faster it will complete. A greedy regex can be exponentially slower.
I thought I would test this out by measuring the time it takes to complete a more complex/explicit statement with a less complex/greedy statement. For the most part everything seems to hold true, but I have one greedy statement that clocked in slower. Here are two examples:
import re
from timeit import timeit
# This works as expected, the explicit is faster than the greedy.
# http_x_real_ip explicit
print(timeit(setup="import re", stmt='''r = re.search(r'(\d{1,3}\.\d{1,3}.\d{1,3}.\d{1,3})', '192.168.1.1 999.999.999.999')''', number=1000000))
1.159849308001867
# http_x_real_ip greedy
print(timeit(setup="import re", stmt='''r = re.search(r'((?:\d{1,3}\.){3}\d{1,3})', '192.168.1.1 999.999.999.999')''', number=1000000))
1.7421739230003368
# This does not work as expected, greedy is faster.
# time_local explicit
print(timeit(setup="import re", stmt='''r = re.search(r'(\d{1,2}/\w{3}/[2][0]\d{2}:\d{2}:\d{2}:\d{2}\s[+][0]{4})', "[23/Jun/2015:11:10:57 +0000]")''', number=1000000))
1.248802040994633
# time_local greedy
print(timeit(setup="import re", stmt='''r = re.search(r'\[(.*)\]', "[23/Jun/2015:11:10:57 +0000]")''', number=1000000))
1.0256699790043058
Is the local_time explict regex just poorly written?
Expose Literal Characters Regex engines match fastest when anchors and literal characters are right there in the main pattern, rather than buried in sub-expressions. Hence the advice to "expose" literal characters whenever you can take them out of an alternation or quantified expression. Let's look at two examples.
Being more specific with your regular expressions, even if they become much longer, can make a world of difference in performance. The fewer characters you scan to determine the match, the faster your regexes will be.
Yet matching a string with a regex can be surprisingly slow. So slow it can even stop any JS app or take 100% of a server CPU time causing denial of service (DOS). At TextMaster we used regular expressions in an important part of our translation platform.
Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster.
The more a regular expression has to backtrack, the slower it is.
This might not hold for very small input data. However, who would care about the performance on small data? :D
This topic is well covered in this article:
Also there are interesting contributions in this question:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With