I know about approximate string searching and things like the Levenshtein distance, but what I want to do is take a large list of strings and quickly pick out any matching pairs that are similar to each other (say, 1 Damerau-Levenshtein distance apart). So something like this
l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]
matching_strings(l)
# Output
# [["moose","mouse"],["rat", "cat"]]
I only really know how to use R and Python, so bonus points if your solution can be easily implemented in one of those languages.
UPDATE:
Thanks to Collapsar's help, here is a solution in Python
import numpy
import functools
alphabet = {'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3, 'g': 6, 'f': 5, 'i': 8, 'h': 7, 'k': 10, 'j': 9, 'm': 12, 'l': 11, 'o': 14, 'n': 13, 'q': 16, 'p': 15, 's': 18, 'r': 17, 'u': 20, 't': 19, 'w': 22, 'v': 21, 'y': 24, 'x': 23, 'z': 25}
l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]
fvlist=[]
for string in l:
fv = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
for letter in string:
fv[alphabet[letter]]+=1
fvlist.append(fv)
fvlist.sort (key=functools.cmp_to_key(lambda fv1,fv2: numpy.sign(numpy.sum(numpy.subtract(fv1, fv2)))))
However, the sorted vectors are returned in the following order:
"rat" "cat" "lion" "fish" "moose" "tiger" "mouse"
Which I would consider to be sub-optimal because I would want moose and mouse to end up next to each other. I understand that however I sort these words there's no way to get all of the words next to all of their closest pairs. However, I am still open to alternative solutions
One way to do that (with complexity O(n k^2)
, where n
is number of strings and k
is the longest string) is to convert every string into a set of masks like this:
rat => ?at, r?t, ra?, ?rat, r?at, ra?t, rat?
This way if two words are different in one letter, like 'rat' and 'cat', they will both have a mask ?at
among others, while if one word is a subsequence of another, like 'rat' and 'rats', they will both have mask 'rat?'.
Then you just group strings based on their masks, and print groups that have more than two strings. You might want to dedup your array first, if it has duplicates.
Here's an example code, with an extra cats
string in it.
l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat", "cats"]
d = {}
def add(mask, s):
if mask not in d:
d[mask] = []
d[mask].append(s)
for s in l:
for pos in range(len(s)):
add(s[:pos] + '?' + s[pos + 1:], s)
add(s[:pos] + '?' + s[pos:], s)
add(s + '?', s)
for k, v in d.items():
if len(v) > 1:
print v
Outputs
['moose', 'mouse']
['rat', 'cat']
['cat', 'cats']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With