Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python speed up this regex sub

Tags:

python

regex

p = re.compile('>.*\n')
p.sub('', text)

I want to delete all lines starting with a '>'. I have a really huge file (3GB) that I process in chunks of size 250MB, so the variable "text" is a string of size 250MB. (I tried different sizes, but the performance was always the same for the complete file).

Now, can I speed up this regex somehow? I tried the multi-line matching, but it was a lot slower. Or are there even better ways?

(I already tried to split the string and then filter out the line like this, but it was also slower (i also tried a lambda instead of def del_line: (that might not be working code, it's just from memory):

def del_line(x): return x[0] != '>'

def func():
    ....
    text = file.readlines(chunksize)
    text = filter(del_line, text)
    ... 

EDIT: As suggested in the comments, I also tried walking line by line:

text = []
for line in file:
    if line[0] != '>':
        text.append(line)
text = ''.join(text)

That's also slower, it needs ~12 sec. My regex need ~7 sec. (yeah, that's fast, but it must also run on slower machines)

EDIT: Of course, I also tried str.startswith('>'), it was slower...

like image 435
Eulelie Avatar asked May 09 '14 21:05

Eulelie


1 Answers

If you have the chance, running grep as a subprocess is probably the most pragmatic choice.

If for whatever reason you can't rely on grep, you could try implementing some of the "tricks" that make grep fast. From the author himself, you can read about them here: http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

At the ending of the article, the author summarizes the main points. The one that stands out to me the most is:

Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for newlines would slow grep down by a factor of several times, because to find the newlines it would have to look at every byte!

The idea would be to load the entire file in memory and iterate with it on byte-level instead of line-level. Only when you find a match, you look for the line boundaries and delete it.

You say you have to run this on other computers. If it's within your reach and you are not doing it already, consider running it on PyPy instead of CPython (the default interpreter). This may (or may not) improve the runtime by a significant factor, depending on the nature of the program.

Also, as some comments already mentioned, benchmark with the actual grep to get a baseline of how fast you can go, reasonably speaking. Get it on Cygwin if you are on Windows, it's easy enough.

like image 129
Rafael Almeida Avatar answered Oct 06 '22 01:10

Rafael Almeida