Given an input string, I would like to find the set of all other strings that are within Levenshtein distance 2. For example, if the input string is "aaa" and the alphabet is ['a', 'b'] then we have:
{'baaa', 'aab', 'a', 'aaaaa', 'baab', 'abbaa', 'abaa', 'aaabb', 'abb', 'aaab', 'ababa', 'aa', 'aabb', 'baba', 'baaab', 'aabab', 'aaaab', 'abaaa', 'aabaa', 'bbaaa', 'abaab', 'aaaa', 'baaaa', 'bab', 'bba', 'aba', 'aaaba', 'ba', 'aabba', 'abab', 'baa', 'aaa', 'bbaa', 'baaba', 'aaba', 'abba', 'ab', 'babaa'}
I have code to do this but it is inefficient. Here it is using all printable ascii characters as the alphabet and the input string aaaaaaaaaa
.
import string
input_string = "a" * 10
f = (
lambda input_string, dist=2, i=0: dist * input_string[i - 1 :]
and {
k[:i] + char + k[i + w:]
for k in f(input_string, dist - 1)
for char in [""] + list(string.printable)
for w in (0, 1)
}
| f(input_string, dist, i + 1)
or {input_string}
)
f(input_string)
I need to run this many times with different input strings. This code takes 2.9s currently on my desktop to make the 1631129 different strings. Can anyone see how to make it much faster?
League table so far (using printable as the alphabet):
My code: 2.98 s ± 146 ms
Alain T.'s code: 1.58 s ± 60.7 ms. Current winner.
ddg's code: 1.85 s ± 28.4 ms
Using iterators (to avoid loading everything in memory) I can get down to 0.17 seconds. With more context about your usage case (why do you need all of them? for a spellchecker?) there's likely alternate approaches that avoid having to enumerate all possibilities.
from string import ascii_lowercase
from itertools import product
def validate(start: str):
assert set(start) <= set(ascii_lowercase)
def iter_deletion(string) -> str:
for i in range(len(string)):
yield string[:i] + string[i+1:]
def iter_insertion(string) -> str:
for i,c in product(range(len(string)+1), ascii_lowercase):
yield string[:i] + c + string[i:]
def iter_replacement(string) -> str:
for i,c in product(range(len(string)), ascii_lowercase):
yield string[:i] + c + string[i+1:]
def iter_steps(string):
yield from iter_deletion(string)
yield from iter_insertion(string)
yield from iter_replacement(string)
def view_all(string):
validate(string)
for d1 in iter_steps(string):
yield from iter_steps(d1)
from timeit import timeit
timeit(lambda: set(view_all('aaaaaaaaaa')), number=1)
I don't think you'll be able to get significant speed improvements generating the whole set of possible edits.
You could shave 35% to 50% of execution time by avoiding the re-expansion of strings that produce the same result after the first edit. This would likely give a more significant difference with a higher editing distance. The level of improvement will also depend on the actual strings/words (that may not be composed of a single repeated letter).
In any case, here's an optimized version of the function:
import string
from collections import deque
def makeEdits(S,count=1,charSet=None):
if charSet is None: charSet = string.printable
result = set([S])
expand = deque(result)
lastEdit = False
def addEdit(s):
if s in result: return
result.add(s)
if not lastEdit: expand.append(s)
for edit in range(count):
editing = expand.copy()
expand.clear()
lastEdit = edit == count-1
for eString in editing:
for i,char in enumerate(eString):
left,right = eString[:i],eString[i+1:]
addEdit(left+right) # deletion
for newChar in charSet:
addEdit(left+newChar+char+right) # insertions before
addEdit(left+newChar+right) # replacement
for newChar in charSet:
addEdit(eString+newChar) # insertion at end
return result
and performance testing on my laptop:
from timeit import timeit
count = 1
input_string = "a" * 10
print('makeEdits()...')
print(len(makeEdits(input_string,2)))
t = timeit(lambda:makeEdits(input_string,2),number=count)
print(t)
print('f()...')
print(len(f(input_string)))
t = timeit(lambda:f(input_string),number=count)
print(t)
...
makeEdits()...
1631129
2.0302121050000004
f()...
1631129
3.145120027999999
Note that a smaller character set, such as string.ascii_letters will improve the performance results considerably (by producing fewer strings):
timeit(lambda:makeEdits(input_string,2,string.ascii_letters),number=count)
# 0.48669491199999015
len(makeEdits(input_string,2,string.ascii_letters)) # 433913
timeit(lambda:makeEdits(input_string,2,string.ascii_lowercase),number=count)
# 0.10699477299999671
len(makeEdits(input_string,2,string.ascii_lowercase)) # 104805
If you're making a spell checker, it would be reasonable to convert everything to lowercase before processing and to treat all special characters as a single character that you include in your character set. This would allow your program to get the performance boost of a smaller character set (30 times faster in this case). You would only need to add some preprocessing of the strings and perhaps some adjustments of results afterward.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With