Longest common substring in two sequences of strings

Tags:

Having just learned the longest common substring algorithm, I was curious about a particular variant of the problem. It is described as follows -:

Given two non-empty sequences of strings, X = (x1, x2, x3,....,x(n)) and Y = (y1, y2, y3,..., y(m)), where x(i) and y(i) are strings of characters, find the longest string in X which is a substring of all the strings of Y.

I have a function substring(x, y) which returns booleans depicting whether x is a substring in y or not. Evidently, I have to concatenate all the strings in Y to form one big string, say, denoted by B. I thought of the following approaches -:

Naive: Start by concatenating all strings in X to form a string A(n). Apply substring(A(n), B) - this includes iterating backward in string A(n). If true, the algorithm ends here and returns A(n) - or whatever portion of it is included in said substring. If not, proceed to apply (A(n - 1), B) and so on. If such a string does not exist in X, I return the empty string.

Obviously this approach would take up quite some running time depending on the implementation. Assuming I use an iterative approach, on each iteration I would have to iterate backward through the String at that level/index, and subsequently apply substring(). It would take atleast two loops, and O(size(B) * maxlength(x1, x2,...)) worst case time, or more depending on substring() (correct me if wrong).

I thought of a second approach based on suffix trees/arrays.

Generalized Suffix Tree: I build a GST of sequence Y using Ukkonen's algorithm in O(maxlength(y1, y2,...)(?). My lack of knowledge of suffix trees bites. I believe a suffix tree approach would substantially reduce running time (at the cost of space) for finding the substring, but I have no idea how to implement the operation.

If there is a better approach, I'd love to know.

EDIT: Apologies if it seemed like I abandoned the topic.

What if I were to use not a GST, but some standard data structure such as a stack, queue, set, heap, priority queue, etc.? The sequence X would have to be sorted, largest string first, naturally. If I store it in a string array, I will have to use a sorting algorithm such as mergesort/quicksort. The goal is to get the most efficient running time as possible.

Can I not store X in a structure that automatically sorts its elements as it builds itself? What about a max-heap?

It would seem like the suffix tree is the best way to find substrings in this fashion. Is there any other data structure I could use?

278

asked Oct 03 '13 11:10

PritishC

1 Answers

First, order the array X for the longest string to shorter. This way, the first string in X that be a substring of all Y strings is the solution.

A multiprocessor algorithm would be the best way to solve the problem of test each X string with all Y strings quickly.

138

answered Oct 02 '22 12:10

Vishkey

Related questions
                            
                                Cannot convert value of type 'Character' to type 'String' in coercion
                            
                                How do I format Chinese characters so they fit the columns?
                            
                                Splitting a list with strings and nested lists of strings into a flat list
                            
                                Why is substring searching using 'in' operator, faster than using KMP algorithm?
                            
                                How to force Text::CSV to store number as text?
                            
                                How can I declare constant strings for use in both an unmanaged C++ dll and in a C# application?
                            
                                Merging two Regular Expressions to Truncate Words in Strings
                            
                                What is the concatenation complexity of balanced ropes?
                            
                                What's the name of this algorithm/routine?
                            
                                Why does TextFieldParser.ReadField remove consecutive newlines from middle of a field?
                            
                                android < 2.3 and java.text.Normalizer
                            
                                Fast String within List Searching
                            
                                How does the c# compiler concat strings [duplicate]
                            
                                What is a good way to split strings here?
                            
                                Internal representation of strings in PHP
                            
                                String permutations in Java (non-recursive)
                            
                                How to store and check synonym of string in Java
                            
                                Hashing same character multiple times
                            
                                Converting byte array to string in C# [duplicate]
                            
                                Is a compile-time checked string-to-int map possible?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Longest common substring in two sequences of strings

Tags:

string

algorithm

data-structures

PritishC

People also ask

1 Answers

Vishkey

Recent Activity

Donate For Us