Longest common substring between two long lists

Question

I have two really long lists, and I want to find the longest common sub string for each element of the first list in the second list.

A simplified example is

L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]

So I want to find the best match for "a_b_c" in L2 ("xy_b_c"), then the best match for "d_e_f" in L2("z_d_e_y"). Best match for me is the string with the longest common characters. In I looked at examples for Levenshtein Distance which works for small lists just fine (http://www.stavros.io/posts/finding-the-levenshtein-distance-in-python/), but my list L2 has 163531 elements and It hasn't been able to find even one match for the last 15 minutes..

I do not have a CS background, can someone point me to some better algorithm (or even better, its implementation? :) ) Thanks a ton.

Current code (copied off the link and someone else from stackoverflow):

L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]

def levenshtein_distance(first, second):
    """Find the Levenshtein distance between two strings."""
    if len(first) > len(second):
        first, second = second, first
    if len(second) == 0:
        return len(first)
    first_length = len(first) + 1
    second_length = len(second) + 1
    distance_matrix = [[0] * second_length for x in range(first_length)]
    for i in range(first_length):
       distance_matrix[i][0] = i
    for j in range(second_length):
       distance_matrix[0][j]=j
    for i in xrange(1, first_length):
        for j in range(1, second_length):
            deletion = distance_matrix[i-1][j] + 1
            insertion = distance_matrix[i][j-1] + 1
            substitution = distance_matrix[i-1][j-1]
            if first[i-1] != second[j-1]:
                substitution += 1
            distance_matrix[i][j] = min(insertion, deletion, substitution)
    return distance_matrix[first_length-1][second_length-1]

for string in L1:
    print sorted(L2,key = lambda x:levenshtein_distance(x,string))[0]

edit- just hit control+C and it gave me an incorrect(but close) answer after 15 minutes. Thats only for the first string and there's a lot of them left..

Ashwini Chaudhary · Accepted Answer

Use the difflib module:

>>> from functools import partial
>>> from difflib import SequenceMatcher
def func(x, y):
    s = SequenceMatcher(None, x, y)
    return s.find_longest_match(0, len(x), 0, len(y)).size
... 
for item in L1:
    f = partial(func, item)
    print max(L2, key=f)
...     
xy_b_c
z_d_e_y

root · Answer

You can also take a look at The Levenshtein Python C extension module. If I tested it on a random string in your example, it appeared to be about 150 times faster than the python implementation. And use max as shown by Ashwini Chaudhary.

Longest common substring between two long lists

Tags:

python

Illusionist

2 Answers

Ashwini Chaudhary

root

Recent Activity

Donate For Us

Longest common substring between two long lists

Tags:

python

Illusionist

2 Answers

Ashwini Chaudhary

root

Related questions

Recent Activity

Donate For Us