Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing string replace in python

I have a simple problem. I have a number of text files where words have been divided (hyphenated) at the end of lines. Something like this:

toward an emotionless evalu-
ation of objectively gained

I would like to get rid of the hyphenation and join the words again. This could be done simply and quickly using the replace() function. However in some cases there are a few extra line breaks after the hyphen. Like this:

end up as a first rate con-


tribution, but that was not

Rather than pile up several calls to replace(), I've just switched to regular expressions and used re.sub('\-\n+', '', text):

def replace_hyphens(text):
    return re.sub('\-\n+', '', text)

This works quite well but I was wondering how I could achieve the same result with a function coded directly in Python. This is what I came up with:

def join_hyphens(text):
    processed = ''
    i = 0
    while i < len(text):
        if text[i] == '-':
            while text[i+1] == '\n':
                i += 1
            i += 1
        processed += text[i]
        i += 1
    return processed

But of course the performance is abysmal compared to regular expressions. If I time it over 100 iterations on a rather long string here are the results.

join_hyphens done in 2.398ms
replace_hyphens done in 0.021ms

What would be the best way to improve performance while still using native Python code ?


Edit: Switching to a list, as suggested, significantly improves performance but still fares poorly compared to regular expressions :

def join_hyphens(text):
    processed = []
    i = 0
    while i < len(text):
        if text[i] == '-':
            while text[i+1] == '\n':
                i += 1
            i += 1
        processed.append(text[i])
        i += 1
    return ''.join(processed)

Gives:

    join_hyphens done in 1.769ms
    replace_hyphens done in 0.020ms
like image 745
user2969402 Avatar asked Apr 04 '18 19:04

user2969402


1 Answers

A bit late to the party but no matter... Everything in the Python Standard Library is kind of considered native Python, as it should be available on any Python system, so that also includes the re module.

However, if you insist doing it in Python alone, instead of iterating over characters one by one, you can employ native text search to skip large swaths of text. That should improve performance quite a bit and in certain cases even beat regex. Of course, string concatenation through "".join() is also much more preferred as others have stated:

def join_hyphens(text):
    pieces = []  # a simple chunk buffer store
    head = 0  # our current search head
    finder = text.find  # optimize lookup for str.find
    add_piece = pieces.append  # optimize lookup for list.append
    while True:
        index = finder("-\n", head)  # find the next hyphen
        if index >= 0:  # check if a hyphen was found
            add_piece(text[head:index])  # add the current chunk
            head = index + 2  # move the search head for after the find
            while text[head] == "\n":  # skip new line characters
                head += 1
        else:
            add_piece(text[head:])  # add the last chunk
            break
    return "".join(pieces)  # join the chunks and return them

And to test it:

text = """end up as a first rate con-


tribution, but that was not"""

print(join_hyphens(text))  # end up as a first rate contribution, but that was not
like image 128
zwer Avatar answered Oct 19 '22 05:10

zwer