Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python grep code much slower than command line's grep

I'm just grepping some Xliff files for the pattern approved="no". I have a Shell script and a Python script, and the difference in performance is huge (for a set of 393 files, and a total of 3,686,329 lines, 0.1s user time for the Shell script, and 6.6s for the Python script).

Shell: grep 'approved="no"' FILE
Python:

def grep(pattern, file_path):
    ret = False

    with codecs.open(file_path, "r", encoding="utf-8") as f:
        while 1 and not ret:
            lines = f.readlines(100000)
            if not lines:
                break
            for line in lines:
                if re.search(pattern, line):
                    ret = True
                    break
    return ret

Any ideas to improve performance with a multiplatform solution?

Results

Here are a couple of results after applying some of the proposed solutions.
Tests were run on a RHEL6 Linux machine, with Python 2.6.6.
Working set: 393 Xliff files, 3,686,329 lines in total.
Numbers are user time in seconds.

grep_1 (io, joining 100,000 file lines): 50s
grep_3 (mmap): 0.7s
Shell version (Linux grep): 0.130s

like image 732
rturrado Avatar asked Aug 12 '16 11:08

rturrado


2 Answers

Python, being an interpreted language vs. a compiled C version of grep will always be slower.

Apart from that your Python implementation is not the same as your grep example. It is not returning the matching lines, it is merely testing to see if the pattern matches the characters on any one line. A closer comparison would be:

grep -q 'approved="no"' FILE

which will return as soon as a match is found and not produce any output.

You can substantially speed up your code by writing your grep() function more efficiently:

def grep_1(pattern, file_path):
    with io.open(file_path, "r", encoding="utf-8") as f:
        while True:
            lines = f.readlines(100000)
            if not lines:
                return False
            if re.search(pattern, ''.join(lines)):
                return True

This uses io instead of codecs which I found was a little faster. The while loop condition does not need to check ret and you can return from the function as soon as the result is known. There's no need to run re.search() for each individual ilne - just join the lines and perform a single search.

At the cost of memory usage you could try this:

import io

def grep_2(pattern, file_path):
    with io.open(file_path, "r", encoding="utf-8") as f:
        return re.search(pattern, f.read())

If memory is an issue you could mmap the file and run the regex search on the mmap:

import io
import mmap

def grep_3(pattern, file_path):
    with io.open(file_path, "r", encoding="utf-8") as f:
        return re.search(pattern, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))

mmap will efficiently read the data from the file in pages without consuming a lot of memory. Also, you'll probably find that mmap runs faster than the other solutions.


Using timeit for each of these functions shows that this is the case:

10 loops, best of 3: 639 msec per loop       # grep()
10 loops, best of 3: 78.7 msec per loop      # grep_1()
10 loops, best of 3: 19.4 msec per loop      # grep_2()
100 loops, best of 3: 5.32 msec per loop     # grep_3()

The file was /usr/share/dict/words containing approx 480,000 lines and the search pattern was zymurgies, which occurs near the end of the file. For comparison, when pattern is near the start of the file, e.g. abaciscus, the times are:

10 loops, best of 3: 62.6 msec per loop       # grep()
1000 loops, best of 3: 1.6 msec per loop      # grep_1()
100 loops, best of 3: 14.2 msec per loop      # grep_2()
10000 loops, best of 3: 37.2 usec per loop    # grep_3()

which again shows that the mmap version is fastest.


Now comparing the grep command with the Python mmap version:

$ time grep -q zymurgies /usr/share/dict/words

real    0m0.010s
user    0m0.007s
sys 0m0.003s

$ time python x.py grep_3    # uses mmap

real    0m0.023s
user    0m0.019s
sys 0m0.004s

Which is not too bad considering the advantages that grep has.

like image 122
mhawke Avatar answered Oct 21 '22 21:10

mhawke


Grep is actually a very clever piece of software, it does not just do a regex search per line. It utilizes the Boyer-Moore algorithm. See here for more information.

See here for a python implementation of grep for more pointers.

like image 37
mzhaase Avatar answered Oct 21 '22 20:10

mzhaase