I have a collection <code>S</code>, typically containing 10-50 long strings. For illustrative purposes, suppose the length of each string ranges between 1000 and 10000 characters. I would like to find strings of specified length <code>k</code> (typically in the range of 5 to 20) that are substrings of every string in <code>S</code>. This can obviously be done using a naive approach - enumerating every k-length substring in <code>S[0]</code> and checking if they exist in every other element of <code>S</code>. Are there more efficient ways of approaching the problem? As far as I can tell, there are some similarities between this and the longest common subsequence problem, but my understanding of LCS is limited and I'm not sure how it could be adapted to the situation where we bound the desired common substring length to <code>k</code>, or if subsequence techniques can be applied to finding substrings.

Here's one fairly simple algorithm, which should be reasonably fast. <ol> <li>Using a rolling hash as in the Rabin-Karp string search algorithm, construct a hash table <code>H0</code> of all the <code>|S0|-k+1</code> length <code>k</code> substrings of <code>S0</code>. That's roughly <code>O(|S0|)</code> since each hash is computed in O(1) from the previous hash, but it will take longer if there are collisions or duplicate substrings. Using a better hash will help you with collisions but if there are a lot of <code>k</code>-length duplicate substrings in <code>S0</code> then you could end up using <code>O(k|S0|)</code>. </li> <li>Now use the same rolling hash on <code>S1</code>. This time, look each substring up in <code>H0</code> and if you find it, remove it from <code>H0</code> and insert it into a new table <code>H1</code>. Again, this should be around <code>O(|S1|)</code> unless you have some pathological case, like both <code>S0</code> and <code>S1</code> are just long repetitions of the same character. (It's also going to be suboptimal if <code>S0</code> and <code>S0</code> are the same string, or have lots of overlapping pieces.)</li> <li>Repeat step 2 for each <code>Si</code>, each time creating a new hash table. (At the end of each iteration of step 2, you can delete the hash table from the previous step.)</li> </ol> At the end, the last hash table will contain all the common <code>k</code>-length substrings. The total run time should be about <code>O(Σ|Si|)</code> but in the worst case it could be <code>O(kΣ|Si|)</code>. Even so, with the problem size as described, it should run in acceptable time.

Some thoughts (N is number of strings, M is average length, K is needed substring size): Approach 1: Walk through all strings, computing rolling hash for k-length strings and storing these hashes in the map (store tuple <code>{key: hash; string_num; position}</code>) time O(NxM), space O(NxM) Extract groups with equal hash, check step-by-step: 1) that size of group >= number of strings 2) all strings are represented in this group 3 3) thorough checking of real substrings for equality (sometimes hashes of distinct substrings might coincide) Approach 2: Build suffix array for every string time O(N x MlogM) space O(N x M) Find intersection of suffix arrays for the first string pair, using merge-like approach (suffixes are sorted), considering only part of suffixes of length k, then continue with the next string and so on

How to efficiently find identical substrings of a specified length in a collection of strings?

Tags:

string

algorithm

I have a collection S, typically containing 10-50 long strings. For illustrative purposes, suppose the length of each string ranges between 1000 and 10000 characters.

I would like to find strings of specified length k (typically in the range of 5 to 20) that are substrings of every string in S. This can obviously be done using a naive approach - enumerating every k-length substring in S[0] and checking if they exist in every other element of S.

Are there more efficient ways of approaching the problem? As far as I can tell, there are some similarities between this and the longest common subsequence problem, but my understanding of LCS is limited and I'm not sure how it could be adapted to the situation where we bound the desired common substring length to k, or if subsequence techniques can be applied to finding substrings.

597

asked Sep 26 '18 03:09

Samantha

3 Answers

Here's one fairly simple algorithm, which should be reasonably fast.

Using a rolling hash as in the Rabin-Karp string search algorithm, construct a hash table H₀ of all the |S₀|-k+1 length k substrings of S₀. That's roughly O(|S₀|) since each hash is computed in O(1) from the previous hash, but it will take longer if there are collisions or duplicate substrings. Using a better hash will help you with collisions but if there are a lot of k-length duplicate substrings in S₀ then you could end up using O(k|S₀|).
Now use the same rolling hash on S₁. This time, look each substring up in H₀ and if you find it, remove it from H₀ and insert it into a new table H₁. Again, this should be around O(|S₁|) unless you have some pathological case, like both S₀ and S₁ are just long repetitions of the same character. (It's also going to be suboptimal if S₀ and S₀ are the same string, or have lots of overlapping pieces.)
Repeat step 2 for each S_i, each time creating a new hash table. (At the end of each iteration of step 2, you can delete the hash table from the previous step.)

At the end, the last hash table will contain all the common k-length substrings.

The total run time should be about O(Σ|S_i|) but in the worst case it could be O(kΣ|S_i|). Even so, with the problem size as described, it should run in acceptable time.

171

answered Sep 17 '22 04:09

rici

Some thoughts (N is number of strings, M is average length, K is needed substring size):

Approach 1:

Walk through all strings, computing rolling hash for k-length strings and storing these hashes in the map (store tuple {key: hash; string_num; position})

time O(NxM), space O(NxM)

Extract groups with equal hash, check step-by-step:
1) that size of group >= number of strings
2) all strings are represented in this group 3
3) thorough checking of real substrings for equality (sometimes hashes of distinct substrings might coincide)

Approach 2:

Build suffix array for every string

time O(N x MlogM) space O(N x M)

Find intersection of suffix arrays for the first string pair, using merge-like approach (suffixes are sorted), considering only part of suffixes of length k, then continue with the next string and so on

answered Sep 20 '22 04:09

MBo

I would treat each long string as a collection of overlapped short strings, so ABCDEFGHI becomes ABCDE, BCDEF, CDEFG, DEFGH, EFGHI. You can represent each short string as a pair of indexes, one specifying the long string and one the starting offset in that string (if this strikes you as naive, skip to the end).

I would then sort each collection into ascending order.

Now you can find the short strings common to the first two collection by merging the sorted lists of indexes, keeping only those from the first collection which are also present in the second collection. Check the survivors of this against the third collection, and so on and the survivors at the end correspond to those short strings which are present in all long strings.

(Alternatively you could maintain a set of pointers into each sorted list and repeatedly look to see if every pointer points at short strings with the same text, then advancing the pointer which points at the smallest short string).

Time is O(n log n) for the initial sort, which dominates. In the worst case - e.g. when every string is AAAAAAAA..AA - there is a factor of k on top of this, because all string compares check all characters and take time k. Hopefully, there is a clever way round this with https://en.wikipedia.org/wiki/Suffix_array which allows you to sort in time O(n) rather than O(nk log n) and the https://en.wikipedia.org/wiki/LCP_array, which should allow you to skip some characters when comparing substrings from different suffix arrays.

Thinking about this again, I think the usual suffix array trick of concatenating all of the strings in question, separated by a character not found in any of them, works here. If you look at the LCP of the resulting suffix array you can split it into sections, splitting at points where where the difference between suffixes occurs less than k characters in. Now each offset in any particular section starts with the same k characters. Now look at the offsets in each section and check to see if there is at least one offset from every possible starting string. If so, this k-character sequence occurs in all starting strings, but not otherwise. (There are suffix array constructions which work with arbitrarily large alphabets so you can always expand your alphabet to produce a character not in any string, if necessary).

answered Sep 19 '22 04:09

mcdowella

Related questions
                            
                                String Concatenation - valueOf or not
                            
                                How to capitalize the first and last letters of every word in a string in java
                            
                                Why do I need to put an L before the string and do I create this variable correctly?
                            
                                Unexpected empty strings within Python strings
                            
                                Julia: comparing strings with special characters
                            
                                How does type conversion internally work? What is the memory utilization for the same?
                            
                                SAP Stringbuilder for ABAP?
                            
                                Replace double single quote (' ') with a single quote (')
                            
                                How to extract id from url ? Google sheet
                            
                                IEnumerable extension method with String.Join returns System.Collections.Generic.List`1[System.String]
                            
                                How split "\n" from user input?
                            
                                Assert - 2 Exact Same String Comparing Returns Error
                            
                                Why is the time complexity of this algorithm exponential?
                            
                                Java - How split(regex, limit) method actually works? [duplicate]
                            
                                How to convert string from df.to_string() back to DataFrame [duplicate]
                            
                                Find sentences begining and ending with hash
                            
                                Dynamic allocation of string arrays in Fortran does not resize
                            
                                Is it ever appropriate to join two strings using the plus sign (+) over concatenating with curly brackets ({}) and `format` in Python 2.7?
                            
                                Indices of a substring in Smalltalk
                            
                                C# Add comma after every number sequence in string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With