Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Binary Trees and Dictionaries

I'm struggling with the concept of when to use binary search trees and when to use dictionaries.

In my application I did a little experiment which used the C5 library TreeDictionary (which I believe is a red-black binary search tree), and the C# dictionary. The dictionary was always faster at add/find operations and also always used less memory space. For example, at 16809 <int, float> entries, the dictionary used 342 KiB whilst the tree used 723 KiB.

I thought that BST's were supposed to be more memory efficient, but it seems that one node of the tree requires more bytes than one entry in a dictionary. What gives? Is there a point at where BST's are better than dictionaries?

Also, as a side question, does anyone know if there is a faster + more memory efficient data structure for storing <int, float> pairs for dictionary type access than either of the mentioned structures?

like image 288
Projectile Fish Avatar asked Jan 28 '10 01:01

Projectile Fish


2 Answers

I thought that BST's were supposed to be more memory efficient, but it seems that one node of the tree requires more bytes than one entry in a dictionary. What gives? Is there a point at where BST's are better than dictionaries?

I've personally never heard of such a principle. Even still, its only a general principle, not a categorical fact etched in the fabric of the universe.

Generally, Dictionaries are really just a fancy wrapper around an array of linked lists. You insert into the dictionary something like:

LinkedList<Tuple<TKey, TValue>> list =
    internalArray[internalArray % key.GetHashCode()];
if (list.Exists(x => x.Key == key))
    throw new Exception("Key already exists");
list.AddLast(Tuple.Create(key, value));

So its nearly O(1) operation. The dictionary uses O(internalArray.Length + n) memory, where n is number of items in the collection.

In general BSTs can be implemented as:

  • linked-lists, which use O(n) space, where n is the number items in the collection.
  • arrays, which use O(2h - n) space where h is the height of the tree and n is the number of items in the collection.
    • Since red-black trees have a bounded height of O(1.44 * n), an array implementation should have a bounded memory usage of about O(21.44n - n)

Odds are, the C5 TreeDictionary is implemented using arrays, which is probably responsible for the wasted space.

What gives? Is there a point at where BST's are better than dictionaries?

Dictionaries have some undesirable properties:

  • There may not be enough continugous blocks of memory to hold your dictionary, even if its memory requirements are much less than than the total available RAM.

  • Evaluating the hash function can take an arbitrarily long length of time. Strings, for example, use Reflector to examine the System.String.GetHashCode method -- you'll notice hashing a string always takes O(n) time, which means it can take considerable time for very long strings. On the hand, comparing strings for inequality almost always faster than hashing, since it may require looking at just the first few chars. Its wholly possible for tree inserts to be faster than dictionary inserts if hash code evaluation takes too long.

    • Int32's GetHashCode method is literally just return this, so you'd be hardpressed to find a case where a hashtable with int keys is slower than a tree dictionary.

RB Trees have some desirable properties:

  • You can find/remove the Min and Max elements in O(log n) time, compared to O(n) time using a dictionary.

  • If a tree is implemented as linked list rather than an array, the tree is usually more space efficient than a dictionary.

  • Likewise, its ridiculous easy to write immutable versions of trees which support insert/lookup/delete in O(log n) time. Dictionaries do not adapt well to immutability, since you need to copy the entire internal array for every operation (actually, I have seen some array-based implementations of immutable finger trees, a kind of general purpose dictionary data structure, but the implementation is very complex).

  • You can traverse all the elements in a tree in sorted order in constant space and O(n) time, whereas you'd need to dump a hash table into an array and sort it to get the same effect.

So, the choice of data structure really depends on what properties you need. If you just want an unordered bag and can guarantee that your hash function evaluate quickly, go with a .Net Dictionary. If you need an ordered bag or have a slow running hash function, go with TreeDictionary.

like image 119
Juliet Avatar answered Oct 02 '22 12:10

Juliet


It does make sense that a tree node would require more storage than a dictionary entry. A binary tree node needs to store the value and both the left and right subtrees. The generic Dictionary<TKey, TValue> is implemented as a hash table which - I'm assuming - either uses a linked list for each bucket (value plus one pointer/reference) or some sort of remapping (just the value). I'd have to have a peek in Reflector to be sure, but for the purpose of this question I don't think it's that important.

The sparser the hash table, the less efficient in terms of storage/memory. If you create a hash table (dictionary) and initialize its capacity to 1 million, and only fill it with 10,000 elements, then I'm pretty sure it would eat up a lot more memory than a BST with 10,000 nodes.

Still, I wouldn't worry about any of this if the amount of nodes/keys is only in the thousands. That's going to be measured in the kilobytes, compared to gigabytes of physical RAM.


If the question is "why would you want to use a binary tree instead of a hash table?" Then the best answer IMO is that binary trees are ordered whereas hash tables are not. You can only search a hash table for keys that are exactly equal to something; with a tree, you can search for a range of values, nearest value, etc. This is a pretty important distinction if you're creating an index or something similar.

like image 39
Aaronaught Avatar answered Oct 02 '22 13:10

Aaronaught