Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest Python method for search and replace on a large string

Tags:

python

regex

I'm looking for the fastest way to replace a large number of sub-strings inside a very large string. Here are two examples I've used.

findall() feels simpler and more elegant, but it takes an astounding amount of time.

finditer() blazes through a large file, but I'm not sure this is the right way to do it.

Here's some sample code. Note that the actual text I'm interested in is a single string around 10MB in size, and there's a huge difference in these two methods.

import re

def findall_replace(text, reg, rep):
    for match in reg.findall(text):
        output = text.replace(match, rep)
    return output

def finditer_replace(text, reg, rep):
    cursor_pos = 0
    output = ''
    for match in reg.finditer(text):
        output += "".join([text[cursor_pos:match.start(1)], rep])
        cursor_pos = match.end(1)
    output += "".join([text[cursor_pos:]])
    return output

reg = re.compile(r'(dog)')
rep = 'cat'
text = 'dog cat dog cat dog cat'

finditer_replace(text, reg, rep)

findall_replace(text, reg, rep)

UPDATE Added re.sub method to tests:

def sub_replace(reg, rep, text):
    output = re.sub(reg, rep, text)
    return output

Results

re.sub() - 0:00:00.031000
finditer() - 0:00:00.109000
findall() - 0:01:17.260000

like image 293
cyrus Avatar asked Feb 04 '11 00:02

cyrus


People also ask

Is regex faster Python?

Conclusion. The regex engine in Perl is much faster than the regex engine of Python.

How do you speed up re search in Python?

One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries. This should be faster, because to process a sentence, you just have to step through each of the words and check if it's a match.

Why find () and replace () method is used in string in Python?

The replace() method returns a copy of the string where all occurrences of a substring are replaced with another substring. The number of times substrings should be replaced by another substring can also be specified.

Which Python method can be used to replace parts of a string?

Python String replace() Method The replace() method replaces a specified phrase with another specified phrase.


1 Answers

The standard method is to use the built-in

re.sub(reg, rep, text)

Incidentally the reason for the performance difference between your versions is that each replacement in your first version causes the entire string to be recopied. Copies are fast, but when you're copying 10 MB at a go, enough copies will become slow.

like image 129
btilly Avatar answered Sep 16 '22 21:09

btilly