Critique this C# Hashmap Implementation?

Question

I wrote a hashmap in C# as a self study exercise. I wanted to implement chaining as a collision handling technique. At first I thought I'd simply use GetHashCode as my hashing algorithm, but I quickly found that use the numbers returned by GetHashCode would not always be viable (size of the int causes a out of mem if you want to index and array by the number and numbers can be negative :(). So, I came up with a kludgey method of narrowing the numbers (see MyGetHashCode).

Does anyone have any pointers/tips/criticism for this implementation (of the hash function and in general)? Thanks in advance!

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespace HashMap
{
    class Program
    {

        public class MyKVP<T, K>
        {
            public T Key { get; set; }
            public K Value { get; set; }
            public MyKVP(T key, K value)
            {
                Key = key;
                Value = value;
            }
        }


        public class MyHashMap<T, K> : IEnumerable<MyKVP<T,K>>
            where T:IComparable
        {

            private const int map_size = 5000;
            private List<MyKVP<T,K>>[] storage;
            public MyHashMap()
            {
                storage = new List<MyKVP<T,K>>[map_size];
            }

            System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
            {
                return GetEnumerator();
            }
            public IEnumerator<MyKVP<T, K>> GetEnumerator()
            {
                foreach (List<MyKVP<T, K>> kvpList in storage)
                {
                    if (kvpList != null)
                    {
                        foreach (MyKVP<T, K> kvp in kvpList)
                        {
                            yield return kvp;
                        }
                    }
                }
            }


            private int MyGetHashCode(T key)
            {
                int i = key.GetHashCode();
                if (i<0) i=i*-1;
                return i / 10000;
            }

            public void Add(T key, K data)
            {
                int value = MyGetHashCode(key);

                SizeIfNeeded(value);

                //is this spot in the hashmap null?
                if (storage[value] == null)
                {
                    //create a new chain
                    storage[value] = new List<MyKVP<T, K>>();
                    storage[value].Add(new MyKVP<T, K>(key, data));
                }
                else
                { 
                    //is this spot taken?
                    MyKVP<T, K> myKvp = Find(value, key);
                    if (myKvp != null) //key exists, throw
                    {
                        throw new Exception("This key exists. no soup for you.");
                    }

                    //if we didn't throw, then add us
                    storage[value].Add(new MyKVP<T, K>(key, data));
                }

            }

            private MyKVP<T, K> Find(int value, T key)
            {
                foreach (MyKVP<T, K> kvp in storage[value])
                {
                    if (kvp.Key.CompareTo(key) == 0)
                    {
                        return kvp;
                    }
                }

                return null;
            }

            private void SizeIfNeeded(int value)
            {
                if (value >= storage.Length)
                {
                    List<MyKVP<T, K>>[] temp = storage;
                    storage = new List<MyKVP<T, K>>[value+1];
                    Array.Copy(temp, storage, temp.Length);
                }
            }

            public K this[T key]
            {

                get 
                {
                    int value = MyGetHashCode(key);
                    if (value > storage.Length) { throw new IndexOutOfRangeException("Key does not exist."); }
                    MyKVP<T, K> myKvp = Find(value, key);
                    if (myKvp == null) throw new Exception("key does not exist");
                    return myKvp.Value;
                }
                set 
                {
                    Add(key, value);
                }
            }


            public void Remove(T key)
            {
                int value = MyGetHashCode(key);
                if (value > storage.Length) { throw new IndexOutOfRangeException("Key does not exist."); }
                if (storage[value] == null) { throw new IndexOutOfRangeException("Key does not exist."); }

                //loop through each kvp at this hash location
                MyKVP<T, K> myKvp = Find(value, key);
                if (myKvp != null)
                {
                    storage[value].Remove(myKvp);
                }
            }
        }

        static void Main(string[] args)
        {
            MyHashMap<string, int> myHashMap = new MyHashMap<string, int>();
            myHashMap.Add("joe", 1);
            myHashMap.Add("mike", 2);
            myHashMap.Add("adam", 3);
            myHashMap.Add("dad", 4);

            Assert.AreEqual(1, myHashMap["joe"]);
            Assert.AreEqual(4, myHashMap["dad"]);
            Assert.AreEqual(2, myHashMap["mike"]);
            Assert.AreEqual(3, myHashMap["adam"]);

            myHashMap.Remove("joe");

            try 
            {
                if (myHashMap["joe"] == 3) { }; //should throw 
            }
            catch (Exception) 
            {
                try { myHashMap.Add("mike",1); }
                catch (Exception) {

                    foreach (MyKVP<string, int> kvp in myHashMap)
                    { 
                        Console.WriteLine(kvp.Key + " " + kvp.Value.ToString());
                    }


                    return;
                }

            }

            throw new Exception("fail");
        }
    }
}

Jon Hanna · Accepted Answer

Your hash method is of a fixed range. This means that a single item could cause 214748 buckets to be created (if it's hashcode rehashed to 214747). A more commonly used (and almost always better approach) is to start with an initial size that is either known (due to knowledge of the domain) to be big enough for all values or to start small and have hashmap resize itself as appropriate. With re-probing the obvious measure of a need to resize is how much reprobing was needed. With chaining as you are experimenting with here, you'll want to keep both average and maximum chain sizes down. This keeps down your worse-case lookup time, and hence your average lookup time closer to the best-case O(1).

The two most common approaches to such hashing (and hence to initial table size) is to either use prime numbers or powers of two. The former is considered (though there is some contention on the point) to offer better distribution of keys while the latter allows for faster computation (both cases do a modulo on the input-hash, but with a number known to be a power of 2, the modulo can be quickly done as a binary-and operation). Another advantage of using a power of two when you are chaining, is that its possible to test a chain to see if resizing the hash would actually cause that chain to be split or not (if you have an 8-value table and there's a chain whose hashes are all either 17, 1 or 33 then doubling the table size would still leave them in the same chain, but quadrupling it would re-distribute them).

You don't have a method offering replace semantics, which is usual with .NET dictionary types (where adding will error if there's already an item with that key, but assigning to an index won't).

Your error on a retrieval that would try to go beyond the number of buckets will make no sense to the user, who doesn't care whether the bucket existed or not, only the key (they need not know how your implementation works at all). Both cases where a key isn't found should throw the same error (System.Collections.Generic.KeyNotFoundException has precisely the right semantics, so you could reuse that.).

Using a List is rather heavy in this case. Generally I'd frown on anyone saying a BCL collection was too heavy, but when it comes to rolling your own collections, its generally either because (1) you want to learn from the exercise or (2) the BCL collections don't suit your purposes. In case (1) you should learn how to complete the job you started, and in case (2) you need to be sure that List doesn't have whatever failing you found with Dictionary.

Your removal both throws a nonsensical error for someone who doesn't know about the implementation details, and an inconsistent error (whether something else existed in that bucket is not something they should care about). Since removing a non-existent item isn't harmful it is more common to merely return a bool indicating whether the item had been present or not, and let the user decide if that indicates an error or not. It is also wasteful in continuing to search the entire bucket after the item has been removed.

Your implementation does now allow null keys, which is reasonable enough (indeed, the documentation for IDictionary<TKey, TValue> says that implementations may or may not do so). However, the way you reject them is by having the NullReferenceException caused by trying to call GetHashCode() on null be returned, rather than checking and throwing a ArgumentNullException. For the user to receive a NullReferenceException suggests that the collection itself was null. This is hence a clear bug.

jgauffin · Answer

A Remove method should never throw an exception. You are trying to remove an item. No harm is done if it have already been removed. All collection classes in .Net uses bool as a return value to indicate if an item was really removed.
Do not throw Exception, throw specific one. Browse through all exceptions in the Collection namespaces to find suitable ones.
Add a TryGetValue
Use KeyValuePair which already is a part of .Net instead of creating your own.
Add a constructor which can define map size.
When throwing exceptions include details to why it was thrown. For instance, instead of writing "This key exists", write string.Format("Key '{0}' already exists", key)

Critique this C# Hashmap Implementation?

Tags:

c#

hashmap

data-structures

j03m

2 Answers

Jon Hanna

jgauffin

Recent Activity

Donate For Us

Critique this C# Hashmap Implementation?

Tags:

c#

hashmap

data-structures

j03m

2 Answers

Jon Hanna

jgauffin

Related questions

Recent Activity

Donate For Us