Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

re.match vs re.search performance difference

Tags:

python

regex

I tried to compare re.match and re.search using timeit module and I found that match was better than search when the string I want to found was at the beginning of the string.

>>> s1 = '''
... import re
... re.search(r'hello','helloab'*100000)
... '''
>>> timeit.timeit(stmt=s1,number=10000)
32.12064480781555


>>> s = '''
... import re
... re.match(r'hello','helloab'*100000)
... '''
>>> timeit.timeit(stmt=s,number=10000)
30.9136700630188

Now, I am aware that match looks for the pattern in the beginning of the string and return an object if found but what I am wondering is how does search operate.

Does search performs any extra matching after the string is found in the beginning which slows it down?

Update

After using @David Robinsons code I got results simlar to him.

>>> print timeit.timeit(stmt="r.match('hello')",
...              setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
...              number = 10000000)
49.9567620754
>>> print timeit.timeit(stmt="r.search('hello')",
...              setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
...             number = 10000000)
35.6694438457

So, the updated question is now why search is out-performing match?

like image 203
RanRag Avatar asked Oct 09 '12 15:10

RanRag


People also ask

What is the difference between re match () and re search ():?

There is a difference between the use of both functions. Both return the first match of a substring found in the string, but re. match() searches only from the beginning of the string and return match object if found. But if a match of substring is found somewhere in the middle of the string, it returns none.

What is the difference between match and search in regular expression?

Python offers two different primitive operations based on regular expressions: match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string (this is what Perl does by default).

Is RegEx faster than find?

Just so you know, the main thing slowing down the regexp here is having to compile the pattern every time. If the pattern is precompiled and used over and over again, matching is only about 25% slower than using find.

What does the function rematch do?

What does the function re. match do? Explanation: It will look for the pattern at the beginning and return None if it isn't found.


2 Answers

On my machine (Python 2.7.3 on Mac OS 10.7.3, 1.7 GHz Intel Core i5), when done putting the string construction, importing re, and the regex compiling in setup, and performing 10000000 iterations instead of 10, I find the opposite:

import timeit

print timeit.timeit(stmt="r.match(s)",
             setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
             number = 10000000)
# 6.43165612221
print timeit.timeit(stmt="r.search(s)",
             setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
            number = 10000000)
# 3.85176897049
like image 24
David Robinson Avatar answered Sep 28 '22 23:09

David Robinson


"So, the updated question is now why search is out-performing match?"

In this particular instance where a literal string is used rather than a regex pattern, indeed re.search is slightly faster than re.match for the default CPython implementation (I have not tested this in other incarnations of Python).

>>> print timeit.timeit(stmt="r.match(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
...              number = 10000000)
3.29107403755
>>> print timeit.timeit(stmt="r.search(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
...             number = 10000000)
2.39184308052

Looking into the C code behind those modules, it appears the search code has a built in optimisation to quickly match patterns prefixed with a string lateral. In the example above, the whole pattern is a literal string with no regex patterns and so this optimised routined is used to match the whole pattern.

Notice how the performance degrades once we introduce regex symbols and, as the literal string prefix gets shorter:

>>> print timeit.timeit(stmt="r.search(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('hell.')",
...             number = 10000000)

3.20765399933
>>> 
>>> print timeit.timeit(stmt="r.search(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('hel.o')",
...             number = 10000000)
3.31512498856
>>> print timeit.timeit(stmt="r.search(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('he.lo')",
...             number = 10000000)
3.31983995438
>>> print timeit.timeit(stmt="r.search(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('h.llo')",
...             number = 10000000)
3.39261603355

For portion of the pattern that contain regex patterns, SRE_MATCH is used to determine matches. That is essentially the same code behind re.match.

Note how the results are close (with re.match marginally faster) if the pattern starts with a regex pattern instead of a literal string.

>>> print timeit.timeit(stmt="r.match(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('.ello')",
...              number = 10000000)
3.22782492638
>>> print timeit.timeit(stmt="r.search(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('.ello')",
...             number = 10000000)
3.31773591042

In other words, ignoring the fact that search and match have different purposes, re.search is faster than re.match only when the pattern is a literal string.

Of course, if you're working with literal strings, you're likely to be better off using string operations instead.

>>> # Detecting exact matches
>>> print timeit.timeit(stmt="s == r", 
...              setup="s = 'helloab'*100000; r = 'hello'", 
...              number = 10000000)
0.339027881622

>>> # Determine if string contains another string
>>> print timeit.timeit(stmt="s in r", 
...              setup="s = 'helloab'*100000; r = 'hello'", 
...              number = 10000000)
0.479326963425


>>> # detecting prefix
>>> print timeit.timeit(stmt="s.startswith(r)",
...              setup="s = 'helloab'*100000; r = 'hello'",
...             number = 10000000)
1.49393510818
>>> print timeit.timeit(stmt="s[:len(r)] == r",
...              setup="s = 'helloab'*100000; r = 'hello'",
...             number = 10000000)
1.21005606651
like image 180
Shawn Chin Avatar answered Sep 29 '22 01:09

Shawn Chin