Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What would be a sensible way to implement a Trie in .NET?

I get the concept behind a trie. But I get a little confused when it comes to implementation.

The most obvious way I could think to structure a Trie type would be to have a Trie maintain an internal Dictionary<char, Trie>. I have in fact written one this way, and it works, but... this seems like overkill. My impression is that a trie should be lightweight, and having a separate Dictionary<char, Trie> for every node does not seem very lightweight to me.

Is there a more appropriate way to implement this structure that I'm missing?


UPDATE: OK! Based on the very helpful input from Jon and leppie, this is what I've come up with so far:

(1) I have the Trie type, which has a private _nodes member of type Trie.INodeCollection.

(2) The Trie.INodeCollection interface has the following members:

interface INodeCollection
{
    bool TryGetNode(char key, out Trie node);
    INodeCollection Add(char key, Trie node);
    IEnumerable<Trie> GetNodes();
}

(3) There are three implementations of this interface:

class SingleNode : INodeCollection
{
    internal readonly char _key;
    internal readonly Trie _trie;

    public SingleNode(char key, Trie trie)
    { /*...*/ }

    // Add returns a SmallNodeCollection.
}

class SmallNodeCollection : INodeCollection
{
    const int MaximumSize = 8; // ?

    internal readonly List<KeyValuePair<char, Trie>> _nodes;

    public SmallNodeCollection(SingleNode node, char key, Trie trie)
    { /*...*/ }

    // Add adds to the list and returns the current instance until MaximumSize,
    // after which point it returns a LargeNodeCollection.
}

class LargeNodeCollection : INodeCollection
{
    private readonly Dictionary<char, Trie> _nodes;

    public LargeNodeCollection(SmallNodeCollection nodes, char key, Trie trie)
    { /*...*/ }

    // Add adds to the dictionary and returns the current instance.
}

(4) When a Trie is first constructed, its _nodes member is null. The first call to Add creates a SingleNode, and subsequent calls to Add go from there, according to the steps described above.

Does this make sense? This feels like an improvement in the sense that it somewhat reduces the "bulkiness" of a Trie (nodes are no longer full-blown Dictionary<char, Trie> objects until they have a sufficient number of children). However, it has also become significantly more complex. Is it too convoluted? Have I taken a complicated route to achieve something that should've been straightforward?

like image 447
Dan Tao Avatar asked Sep 08 '10 06:09

Dan Tao


4 Answers

Well, you need each node to have something which effectively implements IDictionary<char, Trie>. You could write your own custom implementation which varies its internal structure based on how many subnodes it has:

  • For a single subnode, use just a char and a Trie
  • For a small number, use a List<Tuple<char, Trie>> or a LinkedList<Tuple<char,Trie>>
  • For a large number, use a Dictionary<char, Trie>

(Having just seen leppie's answer, this is the kind of hybrid approach he talks about, I believe.)

like image 155
Jon Skeet Avatar answered Oct 23 '22 11:10

Jon Skeet


If your characters are from a limited set (e.g. only uppercase latin alphabet), then you can store a 26 element array, and each lookup is just

Trie next = store[c-'A']

where c is the current lookup character.

like image 30
Damien_The_Unbeliever Avatar answered Oct 23 '22 13:10

Damien_The_Unbeliever


Implementing it as a Dictionary, in my mind, isn't implementing a Trie - that's implementing a Dictionary of Dictionaries.

When I've implemented a trie I've done it the same way as suggested by Damien_The_Unbeliever (+1 there):

public class TrieNode
{
  TrieNode[] Children = new TrieNode[no_of_chars];
}

This ideally requires then that your trie will only support a limited subset of characters indicated by no_of_chars and that you can map input characters to output indices. E.g. if supporting A-Z then you would naturally map A to 0 and Z to 25.

When you then need to add/remove/check existence of a node you then do something like this:

public TrieNode GetNode(char c)
{
  //mapping function - could be a lookup table, or simple arithmetic
  int index = GetIndex(c);
  //TODO: deal with the situation where 'c' is not supported by the map
  return Children[index];
} 

In real cases I've seen this optimized so that AddNode, for example, would take a ref TrieNode so that the node can be newed on demand and automatically placed into the parent TrieNode's Children in the correct place.

You could also use a Ternary Search Tree instead as the memory overhead for a trie can be pretty crazy (especially if you intend to support all 32k of unicode characters!) and the TST performance is rather impressive (and also supports prefix & wildcard searching as well as hamming searches). Equally, TSTs can natively support all unicode characters without having to do any mapping; since they work on a greater-than/less-than/equals operation instead of an absolute index value.

I took the code from here and adapted it slightly (it was written before generics).

I think you'll be pleasantly surprised by TSTs; once I had one implemented I steered away from Tries altogether.

The only tricky thing is keeeping the TST balanced; an issue you don't have with Tries.

like image 27
Andras Zoltan Avatar answered Oct 23 '22 13:10

Andras Zoltan


There are a few ways, but using a singly link list is probably the simplest and lightweight.

I would do some tests to see the amount of child nodes each node has. If not much (say 20 or less), the link list approach should be faster than a hashtable. You could also do a hybrid approach depending on the amount of child nodes.

like image 2
leppie Avatar answered Oct 23 '22 13:10

leppie