Algorithm to find similar strings in a list of many strings

Question

I know about approximate string searching and things like the Levenshtein distance, but what I want to do is take a large list of strings and quickly pick out any matching pairs that are similar to each other (say, 1 Damerau-Levenshtein distance apart). So something like this

l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]

matching_strings(l)

# Output
# [["moose","mouse"],["rat", "cat"]]

I only really know how to use R and Python, so bonus points if your solution can be easily implemented in one of those languages.

UPDATE:

Thanks to Collapsar's help, here is a solution in Python

import numpy
import functools
alphabet = {'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3, 'g': 6, 'f': 5, 'i': 8, 'h': 7, 'k': 10, 'j': 9, 'm': 12, 'l': 11, 'o': 14, 'n': 13, 'q': 16, 'p': 15, 's': 18, 'r': 17, 'u': 20, 't': 19, 'w': 22, 'v': 21, 'y': 24, 'x': 23, 'z': 25}


l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]
fvlist=[]

for string in l:
    fv = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
    for letter in string:
        fv[alphabet[letter]]+=1
    fvlist.append(fv)

fvlist.sort (key=functools.cmp_to_key(lambda fv1,fv2: numpy.sign(numpy.sum(numpy.subtract(fv1, fv2)))))

However, the sorted vectors are returned in the following order:

"rat" "cat" "lion" "fish" "moose" "tiger" "mouse"

Which I would consider to be sub-optimal because I would want moose and mouse to end up next to each other. I understand that however I sort these words there's no way to get all of the words next to all of their closest pairs. However, I am still open to alternative solutions

Ishamael · Accepted Answer

One way to do that (with complexity O(n k^2), where n is number of strings and k is the longest string) is to convert every string into a set of masks like this:

rat => ?at, r?t, ra?, ?rat, r?at, ra?t, rat?

This way if two words are different in one letter, like 'rat' and 'cat', they will both have a mask ?at among others, while if one word is a subsequence of another, like 'rat' and 'rats', they will both have mask 'rat?'.

Then you just group strings based on their masks, and print groups that have more than two strings. You might want to dedup your array first, if it has duplicates.

Here's an example code, with an extra cats string in it.

l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat", "cats"]
d = {}

def add(mask, s):
    if mask not in d:
        d[mask] = []
    d[mask].append(s)

for s in l:
    for pos in range(len(s)):
        add(s[:pos] + '?' + s[pos + 1:], s)
        add(s[:pos] + '?' + s[pos:], s)
    add(s + '?', s)

for k, v in d.items():
    if len(v) > 1:
        print v

Outputs

['moose', 'mouse']
['rat', 'cat']
['cat', 'cats']

Algorithm to find similar strings in a list of many strings

Tags:

algorithm

C_Z_

1 Answers

Ishamael

Recent Activity

Donate For Us

Algorithm to find similar strings in a list of many strings

Tags:

algorithm

C_Z_

1 Answers

Ishamael

Related questions

Recent Activity

Donate For Us