People have addressed before how to make multiple substitutions in a string based on a dictionary (see, for example). There seems to be a group of options based on string.replace
and a group of options based on regular expressions, with a couple more options.
But I am interested in the efficiency of the different methods depending on the size of the dictionary, which I found to have a very important impact.
my_subs = {'Hello world': 'apple', 'is': 'ship'}
string = 'Hello world! This is nice.'
new_string = my_efficient_method(string, my_subs)
To be clear, this question is not about how to make these substitutions, but about which method is more efficient in which conditions, and which caveats apply. In particular, I am looking for the most practical approach when there are many (>100k) strings that are long (10-20k characters) and the dictionary is huge (>80k pairs). Under these circumstances the preferred methods based on regular expressions perform very poorly.
As stated before, there are different approaches, each with different advantages. I am using three different situations for comparison.
For dictionaries 1 and 2 (shorter ones) I repeat each method 50 times in a loop, to get a more consistent timing. With the longer one a single pass for one document takes long enough (sadly). I tested 1 and 2 using the online service tio with Python 3.8. The long one was tested in my laptop with Python 3.6. Only relative performance between methods is relevant, so the minor specifics are not important.
My string is between 28k and 29k characters.
All times given in seconds.
A colleague found Flashtext, a Python library that specializes precisely in this. It allows searching by query and also applying substitutions. It is about two orders of magnitude faster than other alternatives. In the experiment 3 my current best time was 1.8 seconds. Flashtext takes 0.015 seconds.
There are many variations, but the best tend to be very similar to this:
import re
rep = dict((re.escape(k), v) for k, v in my_dict.items())
pattern = re.compile("|".join(rep.keys()))
new_string = pattern.sub(lambda m: rep[re.escape(m.group(0))], string)
Execution times were:
This method simply applies string.replace
in a loop. (Later I talk about problems with this.)
for original, replacement in self.my_dict.items():
string = string.replace(original, replacement)
This solution proposes a variation using reduce
, that applies a Lambda expression iteratively. This is best understood with an example from the official documentation. The expression
reduce(lambda x, y: x+y, [1, 2, 3, 4, 5])
equals ((((1+2)+3)+4)+5)
import functools
new_string = functools.reduce(lambda a, k: a.replace(*k),
my_dict.items(), string)
Python 3.8 allows assignment expressions, as in this method. In its core this also relies on string.replace
.
[string := string.replace(f' {a} ', f' {b} ') for a, b in my_dict.items()]
Execution times were (in parenthesis results for reduce and assignment expressions variants):
This proposal involves using a recursive Lambda.
mrep = lambda s, d: s if not d else mrep(s.replace(*d.popitem()), d)
new_string = mrep(string, my_dict)
Execution times were:
RecursionError
RecursionError
See the update above: Flashtext is much faster than the other alternatives.
You can see from the execution times that the recursive approach is clearly the fastest, but it only works with small dictionaries. It is not recommended to increase the recursion depth much in Python, so this approach is entirely discarded for longer dictionaries.
Regular expressions offer more control over your substitutions. For example, you may use \b
before or after an element to ensure that there are no word characters at that side of the target substring (to prevent {'a': '1'} to be applied to 'apple'). The cost is that performance drops sharply for longer dictionaries, taking almost four times as long as other options.
Assignment expressions, reduce and simply looping replace offer similar performance (assignment expressions could not be tested with the longer dictionary). Taking readability into account, string.replace
seems like the best option. The problem with this, compared to regular expressions, is that substitutions happen sequentially, not in a single pass. So {'a': 'b', 'b': 'c'} returns 'c' for string 'a'. Dictionaries are now ordered in Python (but you may want to keep using OrderedDict) so you can set the order of substitutions carefully to avoid problems. Of course, with 80k substitutions you cannot rely on this.
I am currently using a loop with replace, and doing some preprocessing to minimize trouble. I am adding spaces at both sides of punctuation (also in the dictionary for items containing punctuation). Then I can search for substrings surrounded by spaces, and insert substitutions with spaces as well. This also works when your targets are multiple words:
string = 'This is: an island'
my_dict = {'is': 'is not', 'an island': 'a museum'}
Using replace and regular expressions I get string = ' This is : an island '
so that my replace loop
for original, replacement in self.my_dict.items():
string = string.replace(f' {original} ', f' {replacement} ')
returns ' This is not : a museum '
as intended. Note that 'is' in 'This' and 'island' were left alone. Regular expressions could be used to fix punctuation back, although I don't require this step.
By compile your replacement table into a Finite State Machine. You could finish the replacement in a single phase char by char scan. And using pre-compiled FSM would boost your performance if your would apply same replacement table on many different strings.
Comparing to other methods:
replace
method: Searching substrings using KMP algorithm is actually a FSM. We just build multiple FSM's into a single one.The time complexity of compile
is O(mk2) where m is entries of replacement table and k is the length of a single replacement rule.
The time complexity of replace
is O(n+n') where n is the length of input, n' is the length of output.
Just to be clarify: In codes below, when multiple rules matched on same characters, the one begins earliest applied. In case a tie, the one longest applied.
import typing
mismatch = ''
ConvertTree = typing.Dict[str, typing.Tuple[str, 'ConvertTree']]
def compile(rulePairs:typing.Iterable[typing.Tuple[str, str]]) -> ConvertTree:
# Build Trie
root: ConvertTree = {}
for src, dst in rulePairs:
current = root
for ch in src:
if ch not in current:
node: ConvertTree = {}
node[mismatch] = None
current[ch] = ('', node)
current = current[ch][1]
if not current.get(mismatch, None):
current[mismatch] = (dst, root)
else:
old_dst = current[mismatch][0]
if dst != old_dst:
raise KeyError(f"Found conflict rules:\n * {src} -> {old_dst}\n * {src} -> {dst}")
# Fill non-leaf nodes
node_with_suffix: typing.List[typing.Tuple[ConvertTree, str, str]] = [(root, '', '')]
for node, output, suffix in node_with_suffix:
node_output, node_suffix = output, suffix;
if node.get(mismatch, None):
leaf = node[mismatch]
node_output, node_suffix = leaf[0], ''
for key in node:
if key == mismatch: continue
val = node[key]
next_node = val[1]
if next_node is root: continue
if len(next_node) == 1:
node[key] = next_node[mismatch]
else:
node_with_suffix.append((next_node, node_output, node_suffix + key))
if not node_suffix: continue
node_next = root
for ch in node_suffix:
while node_output != '' and ch not in node_next and node_next is not root:
ref_output, node_next = node_next[mismatch]
node_output += ref_output
if not node_output:
node_output += ch
elif ch in node_next:
ref_output, node_next = node_next[ch]
node_output += ref_output
break
elif node_next is root:
node_output += ch
break
node[mismatch] = (node_output, node_next)
return root
def replace(input_text: str, root: ConvertTree) -> str:
current = root
output_arr = []
for ch in input_text:
while True:
try:
output, current = current[ch]
output_arr.append(output)
except:
if current is root:
output_arr.append(ch)
else:
output, current = current[mismatch]
output_arr.append(output)
continue
break
while current is not root:
output, current = current['']
output_arr.append(output)
return ''.join(output_arr)
my_subs = [('Hello world', 'apple'), ('is', 'ship')]
string = 'Hello world! This is nice.'
new_string = replace(string, compile(my_subs))
print(new_string) # "apple! Thship ship nice."
Please leave a comment if you had find out any bugs of by code. I will try my best to fix them if there are any.
I used the same algorithm on my novel reader which use string replacements to convert between Chinese Simplify and Traditional variance. The related Python code may also be found on my GitHub repo (Chinese).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With