I have to make a Java program which finds all repeating sub-strings of length n in a given String. The input is string is extremely long and a brute-force approach takes too much time.
I alread tried:
Presently I am finding each sub-string separately and checking for repetitions of that sub-string using the KMP alogrithm. This too is taking too much time.
What is a more efficient approach for this problem?
1) You should look at using a suffix tree data structure.
Suffix Tree
This data structure can be built in O(N * log N) time
(I think even in O(N) time using Ukkonen's algorithm)
where N is the size/length of the input string.
It then allows for solving many (otherwise) difficult
tasks in O(M) time where M is the size/length of the pattern.
So even though I didn't try your particular problem, I am pretty sure that
if you use a suffix tree and a smart formulation of your problem, then the
problem can be solved by using a suffix tree (in reasonable O time).
2) A very good book on these (and related) subjects is this one:
Algorithms on Strings, Trees and Sequences
It's not really easy to read though unless you're well-trained in algorithms.
But OK, reading such things is the only way to get well-trained ;)
3) I suggest you have a quick look at this algorithm too.
Aho-Corasick Algorithm
Even though, I am not sure but... this one might be somewhat
off-topic with respect to your particular problem.
I am going to take @peter.petrov's suggestion and enhance it by explaining how can one actually use a suffix tree to solve the problem:
1. Create a suffix tree from the string, let it be `T`.
2. Find all nodes of depth `n` in the tree, let that set of nodes be `S`. This can be done using DFS, for example.
3. For each node `n` in `S`, do the following:
3.1. Do a DFS, and count the number of terminals `n` leads to. Let this number be `count`
3.2. If `count>1`, yield the substring that is related to `n` (the path from root to `n`), and `count`
Note that this algorithm treats any substring of length n
and add it to the set S
, and from there it search for how many times this was actually a substring by counting the number of terminals this substring leads to.
This means that the complexity of the problem is O(Creation + Traversal)
- meaning, you first create the tree and then you traverse it (easy to see you don't traverse in steps 2-3 each node in the tree more than once). Since the traversal is obviously "faster" than creation of the tree - it leaves you with O(Creation)
, which as was pointed by @perer.petrov is O(|S|)
or O(|S|log|S|)
depending on your algorithm of choice.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With