I've doing some performance analysis on the software I develop, and I've found that lookups on a global dictionary of URL's takes about 10% of the application's "load" phase time. The dictionary is implemented as a C++ STL std::map, which has O(lg n) lookups. I'm going to move it to a hash_map, which has roughly fixed time lookups. The stl string class doesn't have a hash code property, and it certainly doesn't cache a hash code. That means that each lookup requires re-generating the hash code.
I'm skeptical that caching the hash code is worth the effort. It would mean changing many lines of code to use a new string class with a cached hash code property. Given that the current implementation does log(n) full string comparisons on every lookup, I think reducing it to basically one string traversal (by the hash function) per lookup is a big win.
Does anyone have experience with caching string hash codes? Has it ever proven worth the effort?
One of the most common applications of hashing strings is to compare it. Comparing strings of length N takes O(N) time complexity but comparing integers take O(1) time complexity. Hence, comparing hash of strings take O(1) time complexity.
If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code. The hash code itself is not guaranteed to be stable.
The complexity of a hashing function is never O(1). If the length of the string is n then the complexity is surely O(n). However, if you compute all hashes in a given array, you won't have to calculate for the second time and you can always compare two strings in O(1) time by comparing the precalculated hashes.
I don't have experience with caching hash codes, but I've done some work recently converting std::map
to std::tr1::unordered_map
. Two thoughts come to mind. First, try profiling that relatively simple change first, because it sometimes makes things worse, depending on what your code is doing. It might give you enough speedup on its own before you try optimizing further. Secondly, what does your profiler say about the other 90% of your initialization time? Even if you optimized the global dictionary stuff down to 0 time, you will at most improve performance by 10%.
One word of warning.
While a hash map can have fixed time lookups, it also can end up having O(N) lookups. While it's not a common case, it does happen.
So while you always have to pay for the O(log N) time in a map, you are also guaranteed that it will not be worse.
When you compare the hash map to the map, also try a Trie, or related data structure (whatever you can get off the shelf):
Trie implementation
Unfortunately you may then spend a lot of time worrying about cache-friendliness. In that respect a Trie is similar to the tree you already have, and a hash map will probably be better-behaved than a naively-allocated tree.
Also, I'm a little confused by the question. If you're looking up the same string object multiple times, such that caching its hash value is worthwhile, shouldn't you just be caching the result of the lookup? The whole point of a hash table is that different objects which are value-equal hash to the same value. If you aren't computing the same hash several times from distinct strings containing the same characters, then your hash table probably isn't doing its job.
If you mean caching the values of the keys already in the hash table, that's up to the hash table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With