Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I cache the hash code of an STL string used as a hash key?

I've doing some performance analysis on the software I develop, and I've found that lookups on a global dictionary of URL's takes about 10% of the application's "load" phase time. The dictionary is implemented as a C++ STL std::map, which has O(lg n) lookups. I'm going to move it to a hash_map, which has roughly fixed time lookups. The stl string class doesn't have a hash code property, and it certainly doesn't cache a hash code. That means that each lookup requires re-generating the hash code.

I'm skeptical that caching the hash code is worth the effort. It would mean changing many lines of code to use a new string class with a cached hash code property. Given that the current implementation does log(n) full string comparisons on every lookup, I think reducing it to basically one string traversal (by the hash function) per lookup is a big win.

Does anyone have experience with caching string hash codes? Has it ever proven worth the effort?

like image 540
David Gladfelter Avatar asked Feb 03 '10 19:02

David Gladfelter


People also ask

What is the time complexity of hashing a string?

One of the most common applications of hashing strings is to compare it. Comparing strings of length N takes O(N) time complexity but comparing integers take O(1) time complexity. Hence, comparing hash of strings take O(1) time complexity.

Is string HashCode unique?

If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code. The hash code itself is not guaranteed to be stable.

What will be the time complexity to find out the hash value for all the strings?

The complexity of a hashing function is never O(1). If the length of the string is n then the complexity is surely O(n). However, if you compute all hashes in a given array, you won't have to calculate for the second time and you can always compare two strings in O(1) time by comparing the precalculated hashes.


3 Answers

I don't have experience with caching hash codes, but I've done some work recently converting std::map to std::tr1::unordered_map. Two thoughts come to mind. First, try profiling that relatively simple change first, because it sometimes makes things worse, depending on what your code is doing. It might give you enough speedup on its own before you try optimizing further. Secondly, what does your profiler say about the other 90% of your initialization time? Even if you optimized the global dictionary stuff down to 0 time, you will at most improve performance by 10%.

like image 187
Michael Kristofik Avatar answered Sep 24 '22 08:09

Michael Kristofik


One word of warning.

While a hash map can have fixed time lookups, it also can end up having O(N) lookups. While it's not a common case, it does happen.

So while you always have to pay for the O(log N) time in a map, you are also guaranteed that it will not be worse.

like image 22
R Samuel Klatchko Avatar answered Sep 24 '22 08:09

R Samuel Klatchko


When you compare the hash map to the map, also try a Trie, or related data structure (whatever you can get off the shelf):

Trie implementation

Unfortunately you may then spend a lot of time worrying about cache-friendliness. In that respect a Trie is similar to the tree you already have, and a hash map will probably be better-behaved than a naively-allocated tree.

Also, I'm a little confused by the question. If you're looking up the same string object multiple times, such that caching its hash value is worthwhile, shouldn't you just be caching the result of the lookup? The whole point of a hash table is that different objects which are value-equal hash to the same value. If you aren't computing the same hash several times from distinct strings containing the same characters, then your hash table probably isn't doing its job.

If you mean caching the values of the keys already in the hash table, that's up to the hash table.

like image 41
Steve Jessop Avatar answered Sep 24 '22 08:09

Steve Jessop