Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Erlang: What is most-wrong with this trie implementation?

Tags:

erlang

trie

Over the holidays, my family loves to play Boggle. Problem is, I'm terrible at Boggle. So I did what any good programmer would do: wrote a program to play for me.

At the core of the algorithm is a simple prefix trie, where each node is a dict of references to the next letters.

This is the trie:add implementation:

add([], Trie) ->
    dict:store(stop, true, Trie);

add([Ch|Rest], Trie) ->
    % setdefault(Key, Default, Dict) ->
    %     case dict:find(Key, Dict) of
    %         { ok, Val } -> { Dict, Val }
    %         error -> { dict:new(), Default }
    %     end.
    { NewTrie, SubTrie } = setdefault(Ch, dict:new(), Trie),
    NewSubTrie = add(Rest, SubTrie),
    dict:store(Ch, NewSubTrie, NewTrie).

And you can see the rest, along with an example of how it's used (at the bottom), here:

http://gist.github.com/263513

Now, this being my first serious program in Erlang, I know there are probably a bunch of things wrong with it… But my immediate concern is that it uses 800 megabytes of RAM.

So, what am I doing most-wrong? And how might I make it a bit less-wrong?

like image 666
David Wolever Avatar asked Dec 26 '09 18:12

David Wolever


2 Answers

You could implement this functionality by simply storing the words in an ets table:

% create table; add words
> ets:new(words, [named_table, set]).
> ets:insert(words, [{"zed"}]).  
> ets:insert(words, [{"zebra"}]).    

% check if word exists
> ets:lookup(words, "zed").          
[{"zed"}]

% check if "ze" has a continuation among the words
78> ets:match(words, {"ze" ++ '$1'}).
[["d"],["bra"]]

If trie is a must, but you can live with a non-functional approach, then you can try digraphs, as Paul already suggested.

If you want to stay functional, you might save some bytes of memory by using structures using less memory, for example proplists, or records, such as -record(node, {a,b,....,x,y,z}).

like image 158
Zed Avatar answered Sep 28 '22 02:09

Zed


I don't remember how much memory a dict takes, but let's estimate. You have 2.5e6 characters and 2e5 words. If your trie had no sharing at all, that would take 2.7e6 associations in the dicts (one for each character and each 'stop' symbol). A simple purely-functional dict representation would maybe 4 words per association -- it could be less, but I'm trying to get an upper bound. On a 64-bit machine, that'd take 8*4*2.7 million bytes, or 86 megabytes. That's only a tenth of your 800M, so something's surely wrong here.

Update: dict.erl represents dicts with a hashtable; this implies lots of overhead when you have a lot of very small dicts, as you do. I'd try changing your code to use the proplists module, which ought to match my calculations above.

like image 26
Darius Bacon Avatar answered Sep 28 '22 02:09

Darius Bacon