Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently find identical substrings of a specified length in a collection of strings?

I have a collection S, typically containing 10-50 long strings. For illustrative purposes, suppose the length of each string ranges between 1000 and 10000 characters.

I would like to find strings of specified length k (typically in the range of 5 to 20) that are substrings of every string in S. This can obviously be done using a naive approach - enumerating every k-length substring in S[0] and checking if they exist in every other element of S.

Are there more efficient ways of approaching the problem? As far as I can tell, there are some similarities between this and the longest common subsequence problem, but my understanding of LCS is limited and I'm not sure how it could be adapted to the situation where we bound the desired common substring length to k, or if subsequence techniques can be applied to finding substrings.

like image 597
Samantha Avatar asked Sep 26 '18 03:09

Samantha


People also ask

How many substrings are in a string of length n?

Approach: The count of sub-strings of length n will always be len – n + 1 where len is the length of the given string.

How do you find common substrings?

For every character in string 1 we increment vector index of that character eg: v[s1[i]-'a']++, for every character of string 2 we check vector for the common characters if v[s2[i]-'a'] > 0 then set flag = true and v[s2[i]-'a']– such that one character of string 2 is compared with only one character of string 1.

How many substrings can be formed from a string?

The total number of substrings formed by string of length N is (N*(N+1))/2, initialise count as (N*(N+1))/2.


3 Answers

Here's one fairly simple algorithm, which should be reasonably fast.

  1. Using a rolling hash as in the Rabin-Karp string search algorithm, construct a hash table H0 of all the |S0|-k+1 length k substrings of S0. That's roughly O(|S0|) since each hash is computed in O(1) from the previous hash, but it will take longer if there are collisions or duplicate substrings. Using a better hash will help you with collisions but if there are a lot of k-length duplicate substrings in S0 then you could end up using O(k|S0|).

  2. Now use the same rolling hash on S1. This time, look each substring up in H0 and if you find it, remove it from H0 and insert it into a new table H1. Again, this should be around O(|S1|) unless you have some pathological case, like both S0 and S1 are just long repetitions of the same character. (It's also going to be suboptimal if S0 and S0 are the same string, or have lots of overlapping pieces.)

  3. Repeat step 2 for each Si, each time creating a new hash table. (At the end of each iteration of step 2, you can delete the hash table from the previous step.)

At the end, the last hash table will contain all the common k-length substrings.

The total run time should be about O(Σ|Si|) but in the worst case it could be O(kΣ|Si|). Even so, with the problem size as described, it should run in acceptable time.

like image 171
rici Avatar answered Sep 17 '22 04:09

rici


Some thoughts (N is number of strings, M is average length, K is needed substring size):

Approach 1:

Walk through all strings, computing rolling hash for k-length strings and storing these hashes in the map (store tuple {key: hash; string_num; position})

time O(NxM), space O(NxM)

Extract groups with equal hash, check step-by-step:
1) that size of group >= number of strings
2) all strings are represented in this group 3
3) thorough checking of real substrings for equality (sometimes hashes of distinct substrings might coincide)

Approach 2:

Build suffix array for every string

time O(N x MlogM) space O(N x M)

Find intersection of suffix arrays for the first string pair, using merge-like approach (suffixes are sorted), considering only part of suffixes of length k, then continue with the next string and so on

like image 31
MBo Avatar answered Sep 20 '22 04:09

MBo


I would treat each long string as a collection of overlapped short strings, so ABCDEFGHI becomes ABCDE, BCDEF, CDEFG, DEFGH, EFGHI. You can represent each short string as a pair of indexes, one specifying the long string and one the starting offset in that string (if this strikes you as naive, skip to the end).

I would then sort each collection into ascending order.

Now you can find the short strings common to the first two collection by merging the sorted lists of indexes, keeping only those from the first collection which are also present in the second collection. Check the survivors of this against the third collection, and so on and the survivors at the end correspond to those short strings which are present in all long strings.

(Alternatively you could maintain a set of pointers into each sorted list and repeatedly look to see if every pointer points at short strings with the same text, then advancing the pointer which points at the smallest short string).

Time is O(n log n) for the initial sort, which dominates. In the worst case - e.g. when every string is AAAAAAAA..AA - there is a factor of k on top of this, because all string compares check all characters and take time k. Hopefully, there is a clever way round this with https://en.wikipedia.org/wiki/Suffix_array which allows you to sort in time O(n) rather than O(nk log n) and the https://en.wikipedia.org/wiki/LCP_array, which should allow you to skip some characters when comparing substrings from different suffix arrays.

Thinking about this again, I think the usual suffix array trick of concatenating all of the strings in question, separated by a character not found in any of them, works here. If you look at the LCP of the resulting suffix array you can split it into sections, splitting at points where where the difference between suffixes occurs less than k characters in. Now each offset in any particular section starts with the same k characters. Now look at the offsets in each section and check to see if there is at least one offset from every possible starting string. If so, this k-character sequence occurs in all starting strings, but not otherwise. (There are suffix array constructions which work with arbitrarily large alphabets so you can always expand your alphabet to produce a character not in any string, if necessary).

like image 41
mcdowella Avatar answered Sep 19 '22 04:09

mcdowella