Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Longest common subsequence of 3+ strings

Tags:

I am trying to find the longest common subsequence of 3 or more strings. The Wikipedia article has a great description of how to do this for 2 strings, but I'm a little unsure of how to extend this to 3 or more strings.

There are plenty of libraries for finding the LCS of 2 strings, so I'd like to use one of them if possible. If I have 3 strings A, B and C, is it valid to find the LCS of A and B as X, and then find the LCS of X and C, or is this the wrong way to do it?

I've implemented it in Python as follows:

import difflib  def lcs(str1, str2):     sm = difflib.SequenceMatcher()     sm.set_seqs(str1, str2)     matching_blocks = [str1[m.a:m.a+m.size] for m in sm.get_matching_blocks()]     return "".join(matching_blocks)  print reduce(lcs, ['abacbdab', 'bdcaba', 'cbacaa']) 

This outputs "ba", however it should be "baa".

like image 728
del Avatar asked Feb 20 '11 13:02

del


2 Answers

Just generalize the recurrence relation.

For three strings:

dp[i, j, k] = 1 + dp[i - 1, j - 1, k - 1] if A[i] = B[j] = C[k]               max(dp[i - 1, j, k], dp[i, j - 1, k], dp[i, j, k - 1]) otherwise 

Should be easy to generalize to more strings from this.

like image 168
IVlad Avatar answered Sep 20 '22 16:09

IVlad


I just had to do this for a homework, so here is my dynamic programming solution in python that's pretty efficient. It is O(nml) where n, m and l are the lengths of the three sequences.

The solution works by creating a 3D array and then enumerating all three sequences to calculate the path of the longest subsequence. Then you can backtrack through the array to reconstruct the actual subsequence from its path.

So, you initialize the array to all zeros, and then enumerate the three sequences. At each step of the enumeration, you either add one to the length of the longest subsequence (if there's a match) or just carry forward the longest subsequence from the previous step of the enumeration.

Once the enumeration is complete, you can now trace back through the array to reconstruct the subsequence from the steps you took. i.e. as you travel backwards from the last entry in the array, each time you encounter a match you look it up in any of the sequences (using the coordinate from the array) and add it to the subsequence.

def lcs3(a, b, c):     m = len(a)     l = len(b)     n = len(c)     subs = [[[0 for k in range(n+1)] for j in range(l+1)] for i in range(m+1)]      for i, x in enumerate(a):         for j, y in enumerate(b):             for k, z in enumerate(c):                 if x == y and y == z:                     subs[i+1][j+1][k+1] = subs[i][j][k] + 1                 else:                     subs[i+1][j+1][k+1] = max(subs[i+1][j+1][k],                                                subs[i][j+1][k+1],                                                subs[i+1][j][k+1])     # return subs[-1][-1][-1] #if you only need the length of the lcs     lcs = ""     while m > 0 and l > 0 and n > 0:         step = subs[m][l][n]         if step == subs[m-1][l][n]:             m -= 1         elif step == subs[m][l-1][n]:             l -= 1         elif step == subs[m][l][n-1]:             n -= 1         else:             lcs += str(a[m-1])             m -= 1             l -= 1             n -= 1      return lcs[::-1] 
like image 33
BarthesSimpson Avatar answered Sep 21 '22 16:09

BarthesSimpson