Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Longest common substring between two long lists

Tags:

python

I have two really long lists, and I want to find the longest common sub string for each element of the first list in the second list.

A simplified example is

L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]

So I want to find the best match for "a_b_c" in L2 ("xy_b_c"), then the best match for "d_e_f" in L2("z_d_e_y"). Best match for me is the string with the longest common characters. In I looked at examples for Levenshtein Distance which works for small lists just fine (http://www.stavros.io/posts/finding-the-levenshtein-distance-in-python/), but my list L2 has 163531 elements and It hasn't been able to find even one match for the last 15 minutes..

I do not have a CS background, can someone point me to some better algorithm (or even better, its implementation? :) ) Thanks a ton.

Current code (copied off the link and someone else from stackoverflow):

L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]

def levenshtein_distance(first, second):
    """Find the Levenshtein distance between two strings."""
    if len(first) > len(second):
        first, second = second, first
    if len(second) == 0:
        return len(first)
    first_length = len(first) + 1
    second_length = len(second) + 1
    distance_matrix = [[0] * second_length for x in range(first_length)]
    for i in range(first_length):
       distance_matrix[i][0] = i
    for j in range(second_length):
       distance_matrix[0][j]=j
    for i in xrange(1, first_length):
        for j in range(1, second_length):
            deletion = distance_matrix[i-1][j] + 1
            insertion = distance_matrix[i][j-1] + 1
            substitution = distance_matrix[i-1][j-1]
            if first[i-1] != second[j-1]:
                substitution += 1
            distance_matrix[i][j] = min(insertion, deletion, substitution)
    return distance_matrix[first_length-1][second_length-1]

for string in L1:
    print sorted(L2,key = lambda x:levenshtein_distance(x,string))[0]

edit- just hit control+C and it gave me an incorrect(but close) answer after 15 minutes. Thats only for the first string and there's a lot of them left..

like image 717
Illusionist Avatar asked Dec 25 '22 16:12

Illusionist


2 Answers

Use the difflib module:

>>> from functools import partial
>>> from difflib import SequenceMatcher
def func(x, y):
    s = SequenceMatcher(None, x, y)
    return s.find_longest_match(0, len(x), 0, len(y)).size
... 
for item in L1:
    f = partial(func, item)
    print max(L2, key=f)
...     
xy_b_c
z_d_e_y
like image 148
Ashwini Chaudhary Avatar answered Jan 12 '23 01:01

Ashwini Chaudhary


You can also take a look at The Levenshtein Python C extension module. If I tested it on a random string in your example, it appeared to be about 150 times faster than the python implementation. And use max as shown by Ashwini Chaudhary.

like image 30
root Avatar answered Jan 12 '23 01:01

root