I have a function that returns True if a string matches at least one regular expression in a list and False otherwise. The function is called often enough that performance is an issue.
When running it through cProfile, the function is spending about 65% of its time doing matches and 35% of its time iterating over the list.
I would think there would be a way to use map() or something but I can't think of a way to have it stop iterating after it finds a match.
Is there a way to make the function faster while still having it return upon finding the first match?
def matches_pattern(str, patterns):
for pattern in patterns:
if pattern.match(str):
return True
return False
You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".
Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement. If you are using regex as to match very small text and small pattern and don't do it because the matcher function . find() is slower than a normal if statement of a switch statement.
\s stands for “whitespace character”. Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n\f]. That is: \s matches a space, a tab, a carriage return, a line feed, or a form feed.
The Python "re" module provides regular expression support. In Python a regular expression search is typically written as: match = re. search(pat, str) The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string.
The first thing that comes to mind is pushing the loop to the C side by using a generator expression:
def matches_pattern(s, patterns):
return any(p.match(s) for p in patterns)
Probably you don't even need a separate function for that.
Another thing you should try out is to build a single, composite regex using the |
alternation operator, so that the engine has a chance to optimize it for you. You can also create the regex dynamically from a list of string patterns, if this is necessary:
def matches_pattern(s, patterns):
return re.match('|'.join('(?:%s)' % p for p in patterns), s)
Of course you need to have your regexes in string form for that to work. Just profile both of these and check which one is faster :)
You might also want to have a look at a general tip for debugging regular expressions in Python. This can also help to find opportunities to optimize.
UPDATE: I was curious and wrote a little benchmark:
import timeit
setup = """
import re
patterns = [".*abc", "123.*", "ab.*", "foo.*bar", "11010.*", "1[^o]*"]*10
strings = ["asdabc", "123awd2", "abasdae23", "fooasdabar", "111", "11010100101", "xxxx", "eeeeee", "dddddddddddddd", "ffffff"]*10
compiled_patterns = list(map(re.compile, patterns))
def matches_pattern(str, patterns):
for pattern in patterns:
if pattern.match(str):
return True
return False
def test0():
for s in strings:
matches_pattern(s, compiled_patterns)
def test1():
for s in strings:
any(p.match(s) for p in compiled_patterns)
def test2():
for s in strings:
re.match('|'.join('(?:%s)' % p for p in patterns), s)
def test3():
r = re.compile('|'.join('(?:%s)' % p for p in patterns))
for s in strings:
r.match(s)
"""
import sys
print(timeit.timeit("test0()", setup=setup, number=1000))
print(timeit.timeit("test1()", setup=setup, number=1000))
print(timeit.timeit("test2()", setup=setup, number=1000))
print(timeit.timeit("test3()", setup=setup, number=1000))
The output on my machine:
1.4120500087738037
1.662621021270752
4.729579925537109
0.1489570140838623
So any
doesn't seem to be faster than your original approach. Building up a regex dynamically also isn't really fast. But if you can manage to build up a regex upfront and use it several times, this might result in better performance. You can also adapt this benchmark to test some other options :)
The way to do this fastest is to combine all the regexes into one with "|"
between them, then make one regex match call. Also, you'll want to compile it once to be sure you're avoiding repeated regex compilation.
For example:
def matches_pattern(s, pats):
pat = "|".join("(%s)" % p for p in pats)
return bool(re.match(pat, s))
This is for pats
as strings, not compiled patterns. If you really only have compiled regexes, then:
def matches_pattern(s, pats):
pat = "|".join("(%s)" % p.pattern for p in pats)
return bool(re.match(pat, s))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With