I am trying to find the longest common subsequence of 3 or more strings. The Wikipedia article has a great description of how to do this for 2 strings, but I'm a little unsure of how to extend this to 3 or more strings.
There are plenty of libraries for finding the LCS of 2 strings, so I'd like to use one of them if possible. If I have 3 strings A, B and C, is it valid to find the LCS of A and B as X, and then find the LCS of X and C, or is this the wrong way to do it?
I've implemented it in Python as follows:
import difflib def lcs(str1, str2): sm = difflib.SequenceMatcher() sm.set_seqs(str1, str2) matching_blocks = [str1[m.a:m.a+m.size] for m in sm.get_matching_blocks()] return "".join(matching_blocks) print reduce(lcs, ['abacbdab', 'bdcaba', 'cbacaa'])
This outputs "ba", however it should be "baa".
Just generalize the recurrence relation.
For three strings:
dp[i, j, k] = 1 + dp[i - 1, j - 1, k - 1] if A[i] = B[j] = C[k] max(dp[i - 1, j, k], dp[i, j - 1, k], dp[i, j, k - 1]) otherwise
Should be easy to generalize to more strings from this.
I just had to do this for a homework, so here is my dynamic programming solution in python that's pretty efficient. It is O(nml) where n, m and l are the lengths of the three sequences.
The solution works by creating a 3D array and then enumerating all three sequences to calculate the path of the longest subsequence. Then you can backtrack through the array to reconstruct the actual subsequence from its path.
So, you initialize the array to all zeros, and then enumerate the three sequences. At each step of the enumeration, you either add one to the length of the longest subsequence (if there's a match) or just carry forward the longest subsequence from the previous step of the enumeration.
Once the enumeration is complete, you can now trace back through the array to reconstruct the subsequence from the steps you took. i.e. as you travel backwards from the last entry in the array, each time you encounter a match you look it up in any of the sequences (using the coordinate from the array) and add it to the subsequence.
def lcs3(a, b, c): m = len(a) l = len(b) n = len(c) subs = [[[0 for k in range(n+1)] for j in range(l+1)] for i in range(m+1)] for i, x in enumerate(a): for j, y in enumerate(b): for k, z in enumerate(c): if x == y and y == z: subs[i+1][j+1][k+1] = subs[i][j][k] + 1 else: subs[i+1][j+1][k+1] = max(subs[i+1][j+1][k], subs[i][j+1][k+1], subs[i+1][j][k+1]) # return subs[-1][-1][-1] #if you only need the length of the lcs lcs = "" while m > 0 and l > 0 and n > 0: step = subs[m][l][n] if step == subs[m-1][l][n]: m -= 1 elif step == subs[m][l-1][n]: l -= 1 elif step == subs[m][l][n-1]: n -= 1 else: lcs += str(a[m-1]) m -= 1 l -= 1 n -= 1 return lcs[::-1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With