Having just learned the longest common substring algorithm, I was curious about a particular variant of the problem. It is described as follows -:
Given two non-empty sequences of strings, X = (x1, x2, x3,....,x(n)) and Y = (y1, y2, y3,..., y(m)), where x(i) and y(i) are strings of characters, find the longest string in X which is a substring of all the strings of Y.
I have a function substring(x, y)
which returns booleans depicting whether x is a substring in y or not. Evidently, I have to concatenate all the strings in Y to form one big string, say, denoted by B. I thought of the following approaches -:
Obviously this approach would take up quite some running time depending on the implementation. Assuming I use an iterative approach, on each iteration I would have to iterate backward through the String at that level/index, and subsequently apply substring(). It would take atleast two loops, and O(size(B) * maxlength(x1, x2,...))
worst case time, or more depending on substring() (correct me if wrong).
I thought of a second approach based on suffix trees/arrays.
O(maxlength(y1, y2,...)
(?). My lack of knowledge of suffix trees bites. I believe a suffix tree approach would substantially reduce running time (at the cost of space) for finding the substring, but I have no idea how to implement the operation.If there is a better approach, I'd love to know.
EDIT: Apologies if it seemed like I abandoned the topic.
What if I were to use not a GST, but some standard data structure such as a stack, queue, set, heap, priority queue, etc.? The sequence X would have to be sorted, largest string first, naturally. If I store it in a string array, I will have to use a sorting algorithm such as mergesort/quicksort. The goal is to get the most efficient running time as possible.
Can I not store X in a structure that automatically sorts its elements as it builds itself? What about a max-heap?
It would seem like the suffix tree is the best way to find substrings in this fashion. Is there any other data structure I could use?
The longest common substring is “abcdez” and is of length 6. Try It!
Using Dynamic Programming to find the LCSCreate a table of dimension n+1*m+1 where n and m are the lengths of X and Y respectively. The first row and the first column are filled with zeros. Fill each cell of the table using the following logic.
First, order the array X for the longest string to shorter. This way, the first string in X that be a substring of all Y strings is the solution.
A multiprocessor algorithm would be the best way to solve the problem of test each X string with all Y strings quickly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With