Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing construction of a trie over all substrings

I am solving a trie related problem. There is a set of strings S. I have to create a trie over all substrings for each string in S. I am using the following routine:

String strings[] = { ... }; // array containing all strings
for(int i = 0; i < strings.length; i++) {
    String w = strings[i];
    for (int j = 0; j < w.length(); j++) {
        for (int k = j + 1; k <= w.length(); k++) {
            trie.insert(w.substring(j, k));
        }
    }
}

I am using the trie implementation provided here. However, I am wondering if there are certain optimizations which can be done in order to reduce the complexity of creating trie over all substrings?

Why do I need this? Because I am trying to solve this problem.

like image 999
Bhoot Avatar asked Jul 01 '15 16:07

Bhoot


2 Answers

If we have N words, each with maximum length L, your algorithm will take O(N*L^3) (supposing that adding to trie is linear with length of adding word). However, the size of the resulting trie (number of nodes) is at most O(N*L^2), so it seems you are wasting time and you could do better.

And indeed you can, but you have to pull a few tricks from you sleeve. Also, you will no longer need the trie.

  1. .substring() in constant time

In Java 7, each String had a backing char[] array as well as starting position and length. This allowed the .substring() method to run in constant time, since String is immutable class. New String object with same backing char[] array was created, only with different start position and length.

You will need to extend this a bit, to support adding at the end of the string, by increasing the length. Always create a new string object, but leave the backing array same.

  1. Recompute hash in constant time after appending single character

Again, let me use Java's hashCode() function for String:

int hash = 0;
for (int i = 0; i < data.length; i++) {
    hash = 31 * hash + data[i];
} // data is the backing array

Now, how will the hash change after adding a single character at the end of the word? Easy, just add it's value (ASCII code) multiplied by 31^length. You can keep powers of 31 in some separate table, other primes can be used as well.

  1. Store all substring in single HashMap

With using tricks 1 and 2, you can generate all substrings in time O(N*L^2), which is the total number of substrings. Just always start with string of length one and add one character at a time. Put all your strings into a single HashMap, to reduce duplicities.

(You can skip 2 and 3 and discard duplicities when/after sorting, perhaps it will be even faster.)

  1. Sort your substrings and you are good to go.

Well, when I got to point 4, I realized my plan wouldn't work, because in sorting you need to compare strings, and that can take O(L) time. I came up with several attempts to solve it, among them bucket sorting, but none would be faster than original O(N*L^3)

I will just this answer here in case it inspires someone.


In case you don't know Aho-Corasic algorithm, take look into that, it could have some use for your problem.

like image 135
kajacx Avatar answered Oct 05 '22 22:10

kajacx


What you need may be suffix automaton. It costs only O(n) time and can recognize all substrings.

Suffix array can also solve this problems.

These two algorithms can solve most string problems, and they are really hard to learn. After you learn those you will solve it.

like image 28
newSolar Avatar answered Oct 05 '22 23:10

newSolar