Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slower search when start character is given is counterintuitive

I've written a Python utility to scan log files for known error patterns.

I was trying to speed up the search by providing the regex engine with additional pattern info. For example, not only that I'm looking for lines with gold, I require that such line must start with an underscore, so: ^_.*gold instead of gold.

As 99% of the lines do not start with an underscore I was expecting a big performance payoff because the regex engine could abort reading the line after just one character. I was surprised to learn the other way.

The following program illustrated the problem:

import re
from time import time
def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"^_") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    for p in patterns:
        start = time()
        for i in xrange(1000*1000):
            match = re.search(p, line)
        end = time() 
        print 'Elapsed: ' + str(end-start) 
main()

I've tried reviewing sre_compile.py looking for an explanation, but its code was too hairy for me.

Could the observed performance be explained by that that the inclusion of the start of line character turns the regex engine operation from a simple substring op to a much more complicated backtracking state machine op? Thereby outweighing any benefits like aborting the search after the first character?

Thinking so, I tried multiplying the line's length by x8, expecting the start of line search to shine, but instead the gap only grow wider (22sec Vs 6sec).

I'm puzzled :o am I missing something here?

like image 457
Gili Nachum Avatar asked Sep 18 '10 11:09

Gili Nachum


4 Answers

You're actually doing two things wrong: If you want to look at the beginning of the string use match not search. Also, don't use re.match( pattern, line), compile the pattern and use pattern.match(line).

import re
from time import time
def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"_") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    for p in patterns:
        start = time()
        for i in xrange(1000*1000):
            match = p.match(line)
        end = time() 
        print 'Elapsed: ' + str(end-start) 
main()

You will see that you have the expected behavior now - both pattern take exactly the same time.

like image 129
Jochen Ritzel Avatar answered Oct 19 '22 20:10

Jochen Ritzel


How about

if line[0] == "_" and "gold" in line:
   print "Yup, it starts with an underscore"
else:
   print "Nope it doesn't"

Seriously, don't overuse regex

like image 39
Robus Avatar answered Oct 19 '22 20:10

Robus


Interesting observation! I've played a bit with it. My guess would be that the regexp engine will scan the entire string for an underscore, and matches this against a line beginning once a match has been found. Perhaps this has to do with uniform behaviour when using re.MULTILINE

If you use re.match in stead of re.search for the underscore pattern, both seem to be equally fast, i.e.

def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"_.*") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    start = time()
    for i in xrange(1000*1000):
        match = re.match(p1, line)
    end = time() 
    print 'Elapsed: ' + str(end-start) 
    start = time()
    for i in xrange(1000*1000):
        match = re.search(p2, line)
    end = time() 
    print 'Elapsed: ' + str(end-start) 

In this case, match will require a match to start matching at the beginning of the string.

Also, be aware that the following use of precompiled patterns seems to be faster:

for p in patterns:
    start = time()
    for i in xrange(1000*1000):
        match = p.search(line)
    end = time() 
    print 'Elapsed: ' + str(end-start)

But the speed difference remains...

like image 43
Ivo van der Wijk Avatar answered Oct 19 '22 21:10

Ivo van der Wijk


Regexes don't always behave as you'd expect. I don't understand the internals, so I can't precisely explain the behavior. One thing to note, if you change from search to match, the patterns switch which is faster (though that isn't exactly what you want).

You're doing the right thing: measure and use the technique that is empirically faster.

like image 1
Ned Batchelder Avatar answered Oct 19 '22 20:10

Ned Batchelder