Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Build trie faster

I'm making an mobile app which needs thousands of fast string lookups and prefix checks. To speed this up, I made a Trie out of my word list, which has about 180,000 words.

Everything's great, but the only problem is that building this huge trie (it has about 400,000 nodes) takes about 10 seconds currently on my phone, which is really slow.

Here's the code that builds the trie.

public SimpleTrie makeTrie(String file) throws Exception {
    String line;
    SimpleTrie trie = new SimpleTrie();

    BufferedReader br = new BufferedReader(new FileReader(file));
    while( (line = br.readLine()) != null) {
        trie.insert(line);
    }
    br.close();

    return trie;
}

The insert method which runs on O(length of key)

public void insert(String key) {
    TrieNode crawler = root;
    for(int level=0 ; level < key.length() ; level++) {
        int index = key.charAt(level) - 'A';
        if(crawler.children[index] == null) {
            crawler.children[index] = getNode();
        }
        crawler = crawler.children[index];
    }
    crawler.valid = true;
}

I'm looking for intuitive methods to build the trie faster. Maybe I build the trie just once on my laptop, store it somehow to the disk, and load it from a file in the phone? But I don't know how to implement this.

Or are there any other prefix data structures which will take less time to build, but have similar lookup time complexity?

Any suggestions are appreciated. Thanks in advance.

EDIT

Someone suggested using Java Serialization. I tried it, but it was very slow with this code:

public void serializeTrie(SimpleTrie trie, String file) {
        try {
            ObjectOutput out = new ObjectOutputStream(new BufferedOutputStream(new FileOutputStream(file)));
            out.writeObject(trie);
            out.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public SimpleTrie deserializeTrie(String file) {
        try {
            ObjectInput in = new ObjectInputStream(new BufferedInputStream(new FileInputStream(file)));
            SimpleTrie trie = (SimpleTrie)in.readObject();
            in.close();
            return trie;
        } catch (IOException | ClassNotFoundException e) {
            e.printStackTrace();
            return null;
        }
    }

Can this above code be made faster?

My trie: http://pastebin.com/QkFisi09

Word list: http://www.isc.ro/lists/twl06.zip

Android IDE used to run code: http://play.google.com/store/apps/details?id=com.jimmychen.app.sand

like image 643
Bruce Avatar asked Sep 23 '13 16:09

Bruce


3 Answers

Double-Array tries are very fast to save/load because all data is stored in linear arrays. They are also very fast to lookup, but the insertions can be costly. I bet there is a Java implementation somewhere.

Also, if your data is static (i.e. you don't update it on phone) consider DAFSA for your task. It is one of the most efficient data structures for storing words (must be better than "standard" tries and radix tries both for size and for speed, better than succinct tries for speed, often better than succinct tries for size). There is a good C++ implementation: dawgdic - you can use it to build DAFSA from command line and then use a Java reader for the resulting data structure (example implementation is here).

like image 83
Mikhail Korobov Avatar answered Nov 03 '22 17:11

Mikhail Korobov


You could store your trie as an array of nodes, with references to child nodes replaced with array indices. Your root node would be the first element. That way, you could easily store/load your trie from simple binary or text format.

public class SimpleTrie {
    public class TrieNode {
        boolean valid;
        int[] children;
    }
    private TrieNode[] nodes;
    private int numberOfNodes;

    private TrieNode getNode() {
        TrieNode t = nodes[++numberOnNodes];
        return t;
    }
}
like image 3
el.pescado - нет войне Avatar answered Nov 03 '22 15:11

el.pescado - нет войне


Just build a large String[] and sort it. Then you can use binary search to find the location of a String. You can also do a query based on prefixes without too much work.

Prefix look-up example:

Compare method:

private static int compare(String string, String prefix) {
    if (prefix.length()>string.length()) return Integer.MIN_VALUE;

    for (int i=0; i<prefix.length(); i++) {
        char s = string.charAt(i);
        char p = prefix.charAt(i);
        if (s!=p) {
            if (p<s) {
                // prefix is before string
                return -1;
            }
            // prefix is after string
            return 1;
        }
    }
    return 0;
}

Finds an occurrence of the prefix in the array and returns it's location (MIN or MAX are mean not found)

private static int recursiveFind(String[] strings, String prefix, int start, int end) {
    if (start == end) {
        String lastValue = strings[start]; // start==end
        if (compare(lastValue,prefix)==0)
            return start; // start==end
        return Integer.MAX_VALUE;
    }

    int low = start;
    int high = end + 1; // zero indexed, so add one.
    int middle = low + ((high - low) / 2);

    String middleValue = strings[middle];
    int comp = compare(middleValue,prefix);
    if (comp == Integer.MIN_VALUE) return comp;
    if (comp==0)
        return middle;
    if (comp>0)
        return recursiveFind(strings, prefix, middle + 1, end);
    return recursiveFind(strings, prefix, start, middle - 1);
}

Gets a String array and prefix, prints out occurrences of prefix in array

private static boolean testPrefix(String[] strings, String prefix) {
    int i = recursiveFind(strings, prefix, 0, strings.length-1);
    if (i==Integer.MAX_VALUE || i==Integer.MIN_VALUE) {
        // not found
        return false;
    }
    // Found an occurrence, now search up and down for other occurrences
    int up = i+1;
    int down = i;
    while (down>=0) {
        String string = strings[down];
        if (compare(string,prefix)==0) {
            System.out.println(string);
        } else {
            break;
        }
        down--;
    }
    while (up<strings.length) {
        String string = strings[up];
        if (compare(string,prefix)==0) {
            System.out.println(string);
        } else {
            break;
        }
        up++;
    }
    return true;
}
like image 3
Justin Avatar answered Nov 03 '22 17:11

Justin