I have a Dictionary<string,int>
that has the potential to contain upwards of 10+ million unique keys. I am trying to reduce the amount of memory that this takes, while still maintaining the functionality of the dictionary.
I had the idea of storing a hash of the string as a long instead, this decreases the apps memory usage to an acceptable amount (~1.5 gig to ~.5 gig), but I don't feel very good about my method for doing this.
long longKey=
BitConverter.ToInt64(cryptoTransformSHA1.ComputeHash(enc.GetBytes(strKey)), 0);
Basically this chops off the end of a SHA1 hash, and puts the first chunk of it into a long, which I then use as a key. While this works, at least for the data I'm testing with, I don't feel like this is a very reliable solution due to the increased possibility for key collisions.
Are there any other ways of reducing the Dictionary's memory footprint, or is the method I have above not as horrible as I think it is?
[edit]
To clarify, I need to maintain the ability to lookup a value contained in the Dictionary using a string. Storing the actual string in the dictionary takes way to much memory. What I would like to do instead is to use a Dictionary<long,int>
where the long is the result of a hashing function on the string.
So I have done something similar recently and for a certain set of reasons that are fairly unique to my application did not use a database. In fact I was try to stop using a database. I have found that GetHashCode is significantly improved in 3.5. One important note, NEVER STORE PERSISTENTLY THE RESULTS FROM GetHashCode. NEVER EVER. They are not guaranteed to be consistent between versions of the framework.
So you really need to conduct an analysis of your data since different hash functions might work better or worse on your data. You also need to account for speed. As a general rule cryptographic hash functions should not have many collisions even as the number of hashes moves into the billions. For things that I need to be unique I typically use SHA1 Managed. In general the CryptoAPI has terrible performance, even if the underlying hash functions perform well.
For a 64bit hash I currently use Lookup3 and FNV1, which are both 32 bit hashes, together. For a collision to occur both would need to collide which is mathematically improbable and I have not seen happen over about 100 million hashes. You can find the code to both publicly available on the web.
Still conduct your own analysis. What has worked for me may not work for you. Actually inside of my office different applications with different requirements actually use different hash functions or combinations of hash functions.
I would avoid any unproven hash functions. There are as many hash functions as people who think that they should be writing them. Do your research and test test test.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With