Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding out whether there exist two identical substrings one next to another

We've got a String.

ABAEABABEABE

Now we've got to check whether there exist a substring that is followed next by another substring that's exactly the same as the first one.

In this example: ABAEABABEABE
ABE is followed by ABE and that are two identical substrings.

In this example:

AAB

It would be simply A, beacuse A is followed by another A.

In this example:
ABCDEFGHIJKLMNO
There doesn't exist such a substring, so the answer would be NO.


I only managed to find an algorithm that would run in O(n^2). That is getting Hashes and its prefixes. Then for each letter we expand simply and check all the words ending on that letter. There are n letters. We need to expand it n times. So it's O(n^2). I believe there should be a O(n log n) algorithm for this problem.

Does anybody have a better idea?

like image 558
Reiji Azuma Avatar asked Jan 28 '15 09:01

Reiji Azuma


People also ask

How do you check if a string is a substring of another in C?

The function strstr returns the first occurrence of a string in another string. This means that strstr can be used to detect whether a string contains another string. In other words, whether a string is a substring of another string.

How do you find a repeated substring in a string python?

Python has a built-in function for counting the repeated substring in a given string called count(). As the name suggests, it counts the occurrence of a substring in a given string.


1 Answers

I guess you want the longest substring possible that follows this pattern.

The first thing to do is to build a Suffix tree of the input string. Using Ukkonen's algorithm, this is O(n).

Now, how does the condition you provided translate in the suffix tree? First things first, you are looking for a repeated substring [1]. Repeated substrings will appear as internal nodes of the suffix tree. The maximum number of nodes in a suffix tree built from a n-char string is 2n - 1.

You can build a Max-Heap containing such repeated substrings, using their length (number of chars). You do NOT keep substrings of length superior to N/2 (see [1]). This is O(N) where N is the number of internal nodes of the suffix tree. For any suffix tree:

0 ≤ Nn - 2

Now, you take the max out of the priority queue and process the internal node i you obtained:

  1. Let Si be the substring related to i, k = 0 and curnode = i
  2. While k < length(Si)
    1. If the key from i to a child of i is equal to Si[k], then k = k+1
    2. Else break the loop.
  3. If k == length(Si), then the substring is a match. Else, you proceed to the next substring.

Complexity summary

Let n be the length of the query string.

  • Building the suffix tree : O(n)
  • Building the Max-heap of repeated substrings: [3]
    • Identifying the repeated substrings (ie. internal nodes) and storing them in an array: O(n)
    • Heapify the array: O(n)
  • Finding the best match: O(n².log(n)) [2]

Hence the overall worst case complexity is the sum of the above and is O(n².log(n)).

Notes

I made the algorithm above... Hence it is suboptimal, if you are brave enough, you can go through this paper that describes a linear time algorithm! In any case, Suffix trees are a key to this problem so I suggest you study them thoroughly.

[1]: Warning, repeated substrings may partially overlap!

[2]: Actually, the worst case complexity is better than this very naive upper bound but I don't know how to prove it (yet?!). For example, if there were n - 2 internal nodes, that would mean that the original string consists of n occurrences of the same character. In that case, the first substring we check is a match => it's O(n.log(n)).

[3]: If we replace the heap construction by a regular sort (O(n.log(n))), the final comparison step runs in O(n²) instead of O(n².log(n)) ... Taking down the overall complexity between O(n.log(n)) (due to the sorting step) and O(n²).

like image 181
Rerito Avatar answered Oct 19 '22 03:10

Rerito