I have two really long lists, and I want to find the longest common sub string for each element of the first list in the second list.
A simplified example is
L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]
So I want to find the best match for "a_b_c" in L2 ("xy_b_c"), then the best match for "d_e_f" in L2("z_d_e_y"). Best match for me is the string with the longest common characters. In I looked at examples for Levenshtein Distance which works for small lists just fine (http://www.stavros.io/posts/finding-the-levenshtein-distance-in-python/), but my list L2 has 163531 elements and It hasn't been able to find even one match for the last 15 minutes..
I do not have a CS background, can someone point me to some better algorithm (or even better, its implementation? :) ) Thanks a ton.
Current code (copied off the link and someone else from stackoverflow):
L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]
def levenshtein_distance(first, second):
"""Find the Levenshtein distance between two strings."""
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(first)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0] * second_length for x in range(first_length)]
for i in range(first_length):
distance_matrix[i][0] = i
for j in range(second_length):
distance_matrix[0][j]=j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 1
substitution = distance_matrix[i-1][j-1]
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
for string in L1:
print sorted(L2,key = lambda x:levenshtein_distance(x,string))[0]
edit- just hit control+C and it gave me an incorrect(but close) answer after 15 minutes. Thats only for the first string and there's a lot of them left..
Use the difflib
module:
>>> from functools import partial
>>> from difflib import SequenceMatcher
def func(x, y):
s = SequenceMatcher(None, x, y)
return s.find_longest_match(0, len(x), 0, len(y)).size
...
for item in L1:
f = partial(func, item)
print max(L2, key=f)
...
xy_b_c
z_d_e_y
You can also take a look at The Levenshtein Python C extension module. If I tested it on a random string in your example, it appeared to be about 150 times faster than the python implementation. And use max
as shown by Ashwini Chaudhary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With