Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find repeating substring of Length N [closed]

I have to make a Java program which finds all repeating sub-strings of length n in a given String. The input is string is extremely long and a brute-force approach takes too much time.

I alread tried:
Presently I am finding each sub-string separately and checking for repetitions of that sub-string using the KMP alogrithm. This too is taking too much time.

What is a more efficient approach for this problem?

like image 355
Program_Dude Avatar asked Jan 04 '15 10:01

Program_Dude


2 Answers

1) You should look at using a suffix tree data structure.

Suffix Tree

This data structure can be built in O(N * log N) time
(I think even in O(N) time using Ukkonen's algorithm)
where N is the size/length of the input string.
It then allows for solving many (otherwise) difficult
tasks in O(M) time where M is the size/length of the pattern.

So even though I didn't try your particular problem, I am pretty sure that
if you use a suffix tree and a smart formulation of your problem, then the
problem can be solved by using a suffix tree (in reasonable O time).

2) A very good book on these (and related) subjects is this one:

Algorithms on Strings, Trees and Sequences

It's not really easy to read though unless you're well-trained in algorithms.
But OK, reading such things is the only way to get well-trained ;)

3) I suggest you have a quick look at this algorithm too.

Aho-Corasick Algorithm

Even though, I am not sure but... this one might be somewhat
off-topic with respect to your particular problem.

like image 77
peter.petrov Avatar answered Oct 01 '22 02:10

peter.petrov


I am going to take @peter.petrov's suggestion and enhance it by explaining how can one actually use a suffix tree to solve the problem:

 1. Create a suffix tree from the string, let it be `T`.
 2. Find all nodes of depth `n` in the tree, let that set of nodes be `S`. This can be done using DFS, for example.
 3. For each node `n` in `S`, do the following:
     3.1. Do a DFS, and count the number of terminals `n` leads to. Let this number be `count`
     3.2. If `count>1`, yield the substring that is related to `n` (the path from root to `n`), and `count`

Note that this algorithm treats any substring of length n and add it to the set S, and from there it search for how many times this was actually a substring by counting the number of terminals this substring leads to.

This means that the complexity of the problem is O(Creation + Traversal) - meaning, you first create the tree and then you traverse it (easy to see you don't traverse in steps 2-3 each node in the tree more than once). Since the traversal is obviously "faster" than creation of the tree - it leaves you with O(Creation), which as was pointed by @perer.petrov is O(|S|) or O(|S|log|S|) depending on your algorithm of choice.

like image 34
amit Avatar answered Oct 01 '22 02:10

amit