Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can this code to find the neighborhood of a string be sped up?

Given an input string, I would like to find the set of all other strings that are within Levenshtein distance 2. For example, if the input string is "aaa" and the alphabet is ['a', 'b'] then we have:

{'baaa', 'aab', 'a', 'aaaaa', 'baab', 'abbaa', 'abaa', 'aaabb', 'abb', 'aaab', 'ababa', 'aa', 'aabb', 'baba', 'baaab', 'aabab', 'aaaab', 'abaaa', 'aabaa', 'bbaaa', 'abaab', 'aaaa', 'baaaa', 'bab', 'bba', 'aba', 'aaaba', 'ba', 'aabba', 'abab', 'baa', 'aaa', 'bbaa', 'baaba', 'aaba', 'abba', 'ab', 'babaa'}

I have code to do this but it is inefficient. Here it is using all printable ascii characters as the alphabet and the input string aaaaaaaaaa.

import string

input_string = "a" * 10
f = (
    lambda input_string, dist=2, i=0: dist * input_string[i - 1 :]
    and {
        k[:i] + char + k[i + w:]
        for k in f(input_string, dist - 1)
        for char in [""] + list(string.printable)
        for w in (0, 1)
    }
    | f(input_string, dist, i + 1)
    or {input_string}
)
f(input_string)

I need to run this many times with different input strings. This code takes 2.9s currently on my desktop to make the 1631129 different strings. Can anyone see how to make it much faster?


League table so far (using printable as the alphabet):

My code: 2.98 s ± 146 ms

Alain T.'s code: 1.58 s ± 60.7 ms. Current winner.

ddg's code: 1.85 s ± 28.4 ms

like image 321
graffe Avatar asked Dec 16 '20 20:12

graffe


Video Answer


2 Answers

Using iterators (to avoid loading everything in memory) I can get down to 0.17 seconds. With more context about your usage case (why do you need all of them? for a spellchecker?) there's likely alternate approaches that avoid having to enumerate all possibilities.

from string import ascii_lowercase
from itertools import product

def validate(start: str):
    assert set(start) <= set(ascii_lowercase)

def iter_deletion(string) -> str:
    for i in range(len(string)):
        yield string[:i] + string[i+1:]
        
def iter_insertion(string) -> str:
    for i,c in product(range(len(string)+1), ascii_lowercase):
        yield string[:i] + c + string[i:]

def iter_replacement(string) -> str:
    for i,c in product(range(len(string)), ascii_lowercase):
        yield string[:i] + c + string[i+1:]

def iter_steps(string):
    yield from iter_deletion(string)
    yield from iter_insertion(string)
    yield from iter_replacement(string)

def view_all(string):
    validate(string)
    for d1 in iter_steps(string):
        yield from iter_steps(d1)

from timeit import timeit
timeit(lambda: set(view_all('aaaaaaaaaa')), number=1)
like image 153
ddg Avatar answered Oct 22 '22 04:10

ddg


I don't think you'll be able to get significant speed improvements generating the whole set of possible edits.

You could shave 35% to 50% of execution time by avoiding the re-expansion of strings that produce the same result after the first edit. This would likely give a more significant difference with a higher editing distance. The level of improvement will also depend on the actual strings/words (that may not be composed of a single repeated letter).

In any case, here's an optimized version of the function:

import string
from collections import deque

def makeEdits(S,count=1,charSet=None):
    if charSet is None: charSet = string.printable
    result   = set([S])
    expand   = deque(result)
    lastEdit = False

    def addEdit(s):
        if s in result: return 
        result.add(s)
        if not lastEdit: expand.append(s)

    for edit in range(count):
        editing = expand.copy()
        expand.clear()
        lastEdit = edit == count-1
        for eString in editing:
            for i,char in enumerate(eString):
                left,right = eString[:i],eString[i+1:]                       
                addEdit(left+right)                  # deletion
                for newChar in charSet:                    
                    addEdit(left+newChar+char+right) # insertions before
                    addEdit(left+newChar+right)      # replacement                   
            for newChar in charSet:
                addEdit(eString+newChar)             # insertion at end
    return result

and performance testing on my laptop:

from timeit import timeit
count = 1
input_string = "a" * 10

print('makeEdits()...')
print(len(makeEdits(input_string,2)))
t = timeit(lambda:makeEdits(input_string,2),number=count)
print(t)

print('f()...')
print(len(f(input_string)))
t = timeit(lambda:f(input_string),number=count)
print(t)

...

makeEdits()...
1631129
2.0302121050000004
f()...
1631129
3.145120027999999

Note that a smaller character set, such as string.ascii_letters will improve the performance results considerably (by producing fewer strings):

timeit(lambda:makeEdits(input_string,2,string.ascii_letters),number=count)

# 0.48669491199999015

len(makeEdits(input_string,2,string.ascii_letters)) # 433913

timeit(lambda:makeEdits(input_string,2,string.ascii_lowercase),number=count)

# 0.10699477299999671

len(makeEdits(input_string,2,string.ascii_lowercase)) # 104805

If you're making a spell checker, it would be reasonable to convert everything to lowercase before processing and to treat all special characters as a single character that you include in your character set. This would allow your program to get the performance boost of a smaller character set (30 times faster in this case). You would only need to add some preprocessing of the strings and perhaps some adjustments of results afterward.

like image 35
Alain T. Avatar answered Oct 22 '22 05:10

Alain T.