Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Longest common substring in two sequences of strings

Having just learned the longest common substring algorithm, I was curious about a particular variant of the problem. It is described as follows -:

Given two non-empty sequences of strings, X = (x1, x2, x3,....,x(n)) and Y = (y1, y2, y3,..., y(m)), where x(i) and y(i) are strings of characters, find the longest string in X which is a substring of all the strings of Y.

I have a function substring(x, y) which returns booleans depicting whether x is a substring in y or not. Evidently, I have to concatenate all the strings in Y to form one big string, say, denoted by B. I thought of the following approaches -:

  • Naive: Start by concatenating all strings in X to form a string A(n). Apply substring(A(n), B) - this includes iterating backward in string A(n). If true, the algorithm ends here and returns A(n) - or whatever portion of it is included in said substring. If not, proceed to apply (A(n - 1), B) and so on. If such a string does not exist in X, I return the empty string.

Obviously this approach would take up quite some running time depending on the implementation. Assuming I use an iterative approach, on each iteration I would have to iterate backward through the String at that level/index, and subsequently apply substring(). It would take atleast two loops, and O(size(B) * maxlength(x1, x2,...)) worst case time, or more depending on substring() (correct me if wrong).

I thought of a second approach based on suffix trees/arrays.

  • Generalized Suffix Tree: I build a GST of sequence Y using Ukkonen's algorithm in O(maxlength(y1, y2,...)(?). My lack of knowledge of suffix trees bites. I believe a suffix tree approach would substantially reduce running time (at the cost of space) for finding the substring, but I have no idea how to implement the operation.

If there is a better approach, I'd love to know.

EDIT: Apologies if it seemed like I abandoned the topic.

What if I were to use not a GST, but some standard data structure such as a stack, queue, set, heap, priority queue, etc.? The sequence X would have to be sorted, largest string first, naturally. If I store it in a string array, I will have to use a sorting algorithm such as mergesort/quicksort. The goal is to get the most efficient running time as possible.

Can I not store X in a structure that automatically sorts its elements as it builds itself? What about a max-heap?

It would seem like the suffix tree is the best way to find substrings in this fashion. Is there any other data structure I could use?

like image 278
PritishC Avatar asked Oct 03 '13 11:10

PritishC


People also ask

Which is longest common substring in S1 Abcdaf S2 Azbcdf?

The longest common substring is “abcdez” and is of length 6. Try It!

How do you calculate LCS?

Using Dynamic Programming to find the LCSCreate a table of dimension n+1*m+1 where n and m are the lengths of X and Y respectively. The first row and the first column are filled with zeros. Fill each cell of the table using the following logic.


1 Answers

First, order the array X for the longest string to shorter. This way, the first string in X that be a substring of all Y strings is the solution.

A multiprocessor algorithm would be the best way to solve the problem of test each X string with all Y strings quickly.

like image 138
Vishkey Avatar answered Oct 02 '22 12:10

Vishkey