Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently make many multiple substitutions in a string

People have addressed before how to make multiple substitutions in a string based on a dictionary (see, for example). There seems to be a group of options based on string.replace and a group of options based on regular expressions, with a couple more options.

But I am interested in the efficiency of the different methods depending on the size of the dictionary, which I found to have a very important impact.

my_subs = {'Hello world': 'apple', 'is': 'ship'}
string = 'Hello world! This is nice.'
new_string = my_efficient_method(string, my_subs)

To be clear, this question is not about how to make these substitutions, but about which method is more efficient in which conditions, and which caveats apply. In particular, I am looking for the most practical approach when there are many (>100k) strings that are long (10-20k characters) and the dictionary is huge (>80k pairs). Under these circumstances the preferred methods based on regular expressions perform very poorly.

like image 602
Pablo Avatar asked Mar 03 '23 03:03

Pablo


2 Answers

As stated before, there are different approaches, each with different advantages. I am using three different situations for comparison.

  1. Short dictionary (847 substitution pairs)
  2. Medium dictionary (2528 pairs)
  3. Long dictionary (80430 pairs)

For dictionaries 1 and 2 (shorter ones) I repeat each method 50 times in a loop, to get a more consistent timing. With the longer one a single pass for one document takes long enough (sadly). I tested 1 and 2 using the online service tio with Python 3.8. The long one was tested in my laptop with Python 3.6. Only relative performance between methods is relevant, so the minor specifics are not important.

My string is between 28k and 29k characters.

All times given in seconds.


UPDATE: Flashtext

A colleague found Flashtext, a Python library that specializes precisely in this. It allows searching by query and also applying substitutions. It is about two orders of magnitude faster than other alternatives. In the experiment 3 my current best time was 1.8 seconds. Flashtext takes 0.015 seconds.


Regular Expressions

There are many variations, but the best tend to be very similar to this:

import re
rep = dict((re.escape(k), v) for k, v in my_dict.items())
pattern = re.compile("|".join(rep.keys()))
new_string = pattern.sub(lambda m: rep[re.escape(m.group(0))], string)

Execution times were:

  1. 1.63
  2. 5.03
  3. 7.7


Replace

This method simply applies string.replace in a loop. (Later I talk about problems with this.)

for original, replacement in self.my_dict.items():
    string = string.replace(original, replacement)

This solution proposes a variation using reduce, that applies a Lambda expression iteratively. This is best understood with an example from the official documentation. The expression

reduce(lambda x, y: x+y, [1, 2, 3, 4, 5])

equals ((((1+2)+3)+4)+5)

import functools
new_string = functools.reduce(lambda a, k: a.replace(*k), 
                              my_dict.items(), string)

Python 3.8 allows assignment expressions, as in this method. In its core this also relies on string.replace.

[string := string.replace(f' {a} ', f' {b} ') for a, b in my_dict.items()]

Execution times were (in parenthesis results for reduce and assignment expressions variants):

  1. 1.37 (1.39) (1.50)
  2. 4.10 (4.12) (4.07)
  3. 1.9 (1.8) (no Python 3.8 in machine)


Recursive Lambda

This proposal involves using a recursive Lambda.

mrep = lambda s, d: s if not d else mrep(s.replace(*d.popitem()), d)
new_string = mrep(string, my_dict)

Execution times were:

  1. 0.07
  2. RecursionError
  3. RecursionError


Practical remarks

See the update above: Flashtext is much faster than the other alternatives.

You can see from the execution times that the recursive approach is clearly the fastest, but it only works with small dictionaries. It is not recommended to increase the recursion depth much in Python, so this approach is entirely discarded for longer dictionaries.

Regular expressions offer more control over your substitutions. For example, you may use \b before or after an element to ensure that there are no word characters at that side of the target substring (to prevent {'a': '1'} to be applied to 'apple'). The cost is that performance drops sharply for longer dictionaries, taking almost four times as long as other options.

Assignment expressions, reduce and simply looping replace offer similar performance (assignment expressions could not be tested with the longer dictionary). Taking readability into account, string.replace seems like the best option. The problem with this, compared to regular expressions, is that substitutions happen sequentially, not in a single pass. So {'a': 'b', 'b': 'c'} returns 'c' for string 'a'. Dictionaries are now ordered in Python (but you may want to keep using OrderedDict) so you can set the order of substitutions carefully to avoid problems. Of course, with 80k substitutions you cannot rely on this.

I am currently using a loop with replace, and doing some preprocessing to minimize trouble. I am adding spaces at both sides of punctuation (also in the dictionary for items containing punctuation). Then I can search for substrings surrounded by spaces, and insert substitutions with spaces as well. This also works when your targets are multiple words:

string = 'This is: an island'
my_dict = {'is': 'is not', 'an island': 'a museum'}

Using replace and regular expressions I get string = ' This is : an island ' so that my replace loop

for original, replacement in self.my_dict.items():
    string = string.replace(f' {original} ', f' {replacement} ')

returns ' This is not : a museum ' as intended. Note that 'is' in 'This' and 'island' were left alone. Regular expressions could be used to fix punctuation back, although I don't require this step.

like image 57
Pablo Avatar answered Mar 17 '23 12:03

Pablo


By compile your replacement table into a Finite State Machine. You could finish the replacement in a single phase char by char scan. And using pre-compiled FSM would boost your performance if your would apply same replacement table on many different strings.

Comparing to other methods:

  • The FSM is similar to a Trie: Replacement values of non-leaf nodes are pre-calculated too. So you won't need to backtrack your Trie to save some more time.
  • The FSM is similar to a pre-compiled regex: Since regex is just a friendly form of FSM.
  • The FSM is also similar to the replace method: Searching substrings using KMP algorithm is actually a FSM. We just build multiple FSM's into a single one.

The time complexity of compile is O(mk2) where m is entries of replacement table and k is the length of a single replacement rule. The time complexity of replace is O(n+n') where n is the length of input, n' is the length of output.

Just to be clarify: In codes below, when multiple rules matched on same characters, the one begins earliest applied. In case a tie, the one longest applied.

import typing

mismatch = ''

ConvertTree = typing.Dict[str, typing.Tuple[str, 'ConvertTree']]
def compile(rulePairs:typing.Iterable[typing.Tuple[str, str]]) -> ConvertTree:
    # Build Trie
    root: ConvertTree = {}
    for src, dst in rulePairs:
        current = root
        for ch in src:
            if ch not in current:
                node: ConvertTree = {}
                node[mismatch] = None
                current[ch] = ('', node)
            current = current[ch][1]
        if not current.get(mismatch, None):
            current[mismatch] = (dst, root)
        else:
            old_dst = current[mismatch][0]
            if dst != old_dst:
                raise KeyError(f"Found conflict rules:\n  * {src} -> {old_dst}\n  * {src} -> {dst}")

    # Fill non-leaf nodes
    node_with_suffix: typing.List[typing.Tuple[ConvertTree, str, str]] = [(root, '', '')]
    for node, output, suffix in node_with_suffix:
        node_output, node_suffix = output, suffix;
        if node.get(mismatch, None):
            leaf = node[mismatch]
            node_output, node_suffix = leaf[0], ''
        for key in node:
            if key == mismatch: continue
            val = node[key]
            next_node = val[1]
            if next_node is root: continue
            if len(next_node) == 1:
                node[key] = next_node[mismatch]
            else:
                node_with_suffix.append((next_node, node_output, node_suffix + key))
        if not node_suffix: continue
        node_next = root
        for ch in node_suffix:
            while node_output != '' and ch not in node_next and node_next is not root:
                ref_output, node_next = node_next[mismatch]
                node_output += ref_output
            if not node_output:
                node_output += ch
            elif ch in node_next:
                ref_output, node_next = node_next[ch]
                node_output += ref_output
                break
            elif node_next is root:
                node_output += ch
                break
        node[mismatch] = (node_output, node_next)

    return root

def replace(input_text: str, root: ConvertTree) -> str:
    current = root
    output_arr = []
    for ch in input_text:
        while True:
            try:
                output, current = current[ch]
                output_arr.append(output)
            except:
                if current is root:
                    output_arr.append(ch)
                else:
                    output, current = current[mismatch]
                    output_arr.append(output)
                    continue
            break
    while current is not root:
        output, current = current['']
        output_arr.append(output)
    return ''.join(output_arr)

my_subs = [('Hello world', 'apple'), ('is', 'ship')]
string = 'Hello world! This is nice.'
new_string = replace(string, compile(my_subs))
print(new_string) # "apple! Thship ship nice."

Please leave a comment if you had find out any bugs of by code. I will try my best to fix them if there are any.


I used the same algorithm on my novel reader which use string replacements to convert between Chinese Simplify and Traditional variance. The related Python code may also be found on my GitHub repo (Chinese).

like image 35
tsh Avatar answered Mar 17 '23 13:03

tsh