Python, fastest way to iterate over regular expressions but stop on first match

Tags:

I have a function that returns True if a string matches at least one regular expression in a list and False otherwise. The function is called often enough that performance is an issue.

When running it through cProfile, the function is spending about 65% of its time doing matches and 35% of its time iterating over the list.

I would think there would be a way to use map() or something but I can't think of a way to have it stop iterating after it finds a match.

Is there a way to make the function faster while still having it return upon finding the first match?

def matches_pattern(str, patterns):
    for pattern in patterns:
        if pattern.match(str):
            return True
    return False

630

asked May 14 '12 21:05

puffenstuff

2 Answers

The first thing that comes to mind is pushing the loop to the C side by using a generator expression:

def matches_pattern(s, patterns):
    return any(p.match(s) for p in patterns)

Probably you don't even need a separate function for that.

Another thing you should try out is to build a single, composite regex using the | alternation operator, so that the engine has a chance to optimize it for you. You can also create the regex dynamically from a list of string patterns, if this is necessary:

def matches_pattern(s, patterns):
    return re.match('|'.join('(?:%s)' % p for p in patterns), s)

Of course you need to have your regexes in string form for that to work. Just profile both of these and check which one is faster :)

You might also want to have a look at a general tip for debugging regular expressions in Python. This can also help to find opportunities to optimize.

UPDATE: I was curious and wrote a little benchmark:

import timeit

setup = """
import re
patterns = [".*abc", "123.*", "ab.*", "foo.*bar", "11010.*", "1[^o]*"]*10
strings = ["asdabc", "123awd2", "abasdae23", "fooasdabar", "111", "11010100101", "xxxx", "eeeeee", "dddddddddddddd", "ffffff"]*10
compiled_patterns = list(map(re.compile, patterns))

def matches_pattern(str, patterns):
    for pattern in patterns:
        if pattern.match(str):
            return True
    return False

def test0():
    for s in strings:
        matches_pattern(s, compiled_patterns)

def test1():
    for s in strings:
        any(p.match(s) for p in compiled_patterns)

def test2():
    for s in strings:
        re.match('|'.join('(?:%s)' % p for p in patterns), s)

def test3():
    r = re.compile('|'.join('(?:%s)' % p for p in patterns))
    for s in strings:
        r.match(s)
"""

import sys
print(timeit.timeit("test0()", setup=setup, number=1000))
print(timeit.timeit("test1()", setup=setup, number=1000))
print(timeit.timeit("test2()", setup=setup, number=1000))
print(timeit.timeit("test3()", setup=setup, number=1000))

The output on my machine:

1.4120500087738037
1.662621021270752
4.729579925537109
0.1489570140838623

So any doesn't seem to be faster than your original approach. Building up a regex dynamically also isn't really fast. But if you can manage to build up a regex upfront and use it several times, this might result in better performance. You can also adapt this benchmark to test some other options :)

154

answered Oct 06 '22 16:10

Niklas B.

The way to do this fastest is to combine all the regexes into one with "|" between them, then make one regex match call. Also, you'll want to compile it once to be sure you're avoiding repeated regex compilation.

For example:

def matches_pattern(s, pats):
    pat = "|".join("(%s)" % p for p in pats)
    return bool(re.match(pat, s))

This is for pats as strings, not compiled patterns. If you really only have compiled regexes, then:

def matches_pattern(s, pats):
    pat = "|".join("(%s)" % p.pattern for p in pats)
    return bool(re.match(pat, s))

answered Oct 06 '22 14:10

Ned Batchelder

Related questions
                            
                                What is this cProfile result telling me I need to fix?
                            
                                .htaccess mod_rewrite performance
                            
                                Why is numpy much slower than matlab on a digitize example?
                            
                                Why does adding a polymorphic type signature degrade performance?
                            
                                Creating a numpy array of 3D coordinates from three 1D arrays
                            
                                Why does using arguments make this function so much slower?
                            
                                Efficient way to take the minimum/maximum n values and indices from a matrix using NumPy
                            
                                CouchDB view is extremely slow
                            
                                slow rails stack
                            
                                Optimal Buffer size for read-process-write
                            
                                Very slow DELETE query
                            
                                testing tic tac toe win condition [duplicate]
                            
                                Compiler optimization: g++ slower than intel
                            
                                Some Oreo devices are not getting Push Notification
                            
                                python string join performance
                            
                                Why is matrix multiplication in .NET so slow?
                            
                                Any way to speed up Python and Pygame?
                            
                                r: for loop operation with nested indices runs super slow
                            
                                In R, why is sum so slow compared to others, such as cumsum?
                            
                                Does passing a value type in an "out" parameter cause the variable to be boxed?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python, fastest way to iterate over regular expressions but stop on first match

Tags:

performance

python

python-3.x

puffenstuff

People also ask

2 Answers

Niklas B.

Ned Batchelder

Recent Activity

Donate For Us