Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python, fastest way to iterate over regular expressions but stop on first match

I have a function that returns True if a string matches at least one regular expression in a list and False otherwise. The function is called often enough that performance is an issue.

When running it through cProfile, the function is spending about 65% of its time doing matches and 35% of its time iterating over the list.

I would think there would be a way to use map() or something but I can't think of a way to have it stop iterating after it finds a match.

Is there a way to make the function faster while still having it return upon finding the first match?

def matches_pattern(str, patterns):
    for pattern in patterns:
        if pattern.match(str):
            return True
    return False
like image 630
puffenstuff Avatar asked May 14 '12 21:05

puffenstuff


People also ask

How do you stop greedy in regex?

You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".

Is regex faster than for loop?

Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement. If you are using regex as to match very small text and small pattern and don't do it because the matcher function . find() is slower than a normal if statement of a switch statement.

How do you match a space in regex?

\s stands for “whitespace character”. Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n\f]. That is: \s matches a space, a tab, a carriage return, a line feed, or a form feed.

What is re method in python?

The Python "re" module provides regular expression support. In Python a regular expression search is typically written as: match = re. search(pat, str) The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string.


2 Answers

The first thing that comes to mind is pushing the loop to the C side by using a generator expression:

def matches_pattern(s, patterns):
    return any(p.match(s) for p in patterns)

Probably you don't even need a separate function for that.

Another thing you should try out is to build a single, composite regex using the | alternation operator, so that the engine has a chance to optimize it for you. You can also create the regex dynamically from a list of string patterns, if this is necessary:

def matches_pattern(s, patterns):
    return re.match('|'.join('(?:%s)' % p for p in patterns), s)

Of course you need to have your regexes in string form for that to work. Just profile both of these and check which one is faster :)

You might also want to have a look at a general tip for debugging regular expressions in Python. This can also help to find opportunities to optimize.

UPDATE: I was curious and wrote a little benchmark:

import timeit

setup = """
import re
patterns = [".*abc", "123.*", "ab.*", "foo.*bar", "11010.*", "1[^o]*"]*10
strings = ["asdabc", "123awd2", "abasdae23", "fooasdabar", "111", "11010100101", "xxxx", "eeeeee", "dddddddddddddd", "ffffff"]*10
compiled_patterns = list(map(re.compile, patterns))

def matches_pattern(str, patterns):
    for pattern in patterns:
        if pattern.match(str):
            return True
    return False

def test0():
    for s in strings:
        matches_pattern(s, compiled_patterns)

def test1():
    for s in strings:
        any(p.match(s) for p in compiled_patterns)

def test2():
    for s in strings:
        re.match('|'.join('(?:%s)' % p for p in patterns), s)

def test3():
    r = re.compile('|'.join('(?:%s)' % p for p in patterns))
    for s in strings:
        r.match(s)
"""

import sys
print(timeit.timeit("test0()", setup=setup, number=1000))
print(timeit.timeit("test1()", setup=setup, number=1000))
print(timeit.timeit("test2()", setup=setup, number=1000))
print(timeit.timeit("test3()", setup=setup, number=1000))

The output on my machine:

1.4120500087738037
1.662621021270752
4.729579925537109
0.1489570140838623

So any doesn't seem to be faster than your original approach. Building up a regex dynamically also isn't really fast. But if you can manage to build up a regex upfront and use it several times, this might result in better performance. You can also adapt this benchmark to test some other options :)

like image 154
Niklas B. Avatar answered Oct 06 '22 16:10

Niklas B.


The way to do this fastest is to combine all the regexes into one with "|" between them, then make one regex match call. Also, you'll want to compile it once to be sure you're avoiding repeated regex compilation.

For example:

def matches_pattern(s, pats):
    pat = "|".join("(%s)" % p for p in pats)
    return bool(re.match(pat, s))

This is for pats as strings, not compiled patterns. If you really only have compiled regexes, then:

def matches_pattern(s, pats):
    pat = "|".join("(%s)" % p.pattern for p in pats)
    return bool(re.match(pat, s))
like image 28
Ned Batchelder Avatar answered Oct 06 '22 14:10

Ned Batchelder