We are developing a highly performance critical software in C++. There we need a concurrent hash map and implemented one. So we wrote a benchmark to figure out, how much slower our concurrent hash map is compared with std::unordered_map
.
But, std::unordered_map
seems to be incredibly slow... So this is our micro-benchmark (for the concurrent map we spawned a new thread to make sure that locking does not get optimized away and note that I never inser 0 because I also benchmark with google::dense_hash_map
, which needs a null value):
boost::random::mt19937 rng; boost::random::uniform_int_distribution<> dist(std::numeric_limits<uint64_t>::min(), std::numeric_limits<uint64_t>::max()); std::vector<uint64_t> vec(SIZE); for (int i = 0; i < SIZE; ++i) { uint64_t val = 0; while (val == 0) { val = dist(rng); } vec[i] = val; } std::unordered_map<int, long double> map; auto begin = std::chrono::high_resolution_clock::now(); for (int i = 0; i < SIZE; ++i) { map[vec[i]] = 0.0; } auto end = std::chrono::high_resolution_clock::now(); auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin); std::cout << "inserts: " << elapsed.count() << std::endl; std::random_shuffle(vec.begin(), vec.end()); begin = std::chrono::high_resolution_clock::now(); long double val; for (int i = 0; i < SIZE; ++i) { val = map[vec[i]]; } end = std::chrono::high_resolution_clock::now(); elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin); std::cout << "get: " << elapsed.count() << std::endl;
(EDIT: the whole source code can be found here: http://pastebin.com/vPqf7eya)
The result for std::unordered_map
is:
inserts: 35126 get : 2959
For google::dense_map
:
inserts: 3653 get : 816
For our hand backed concurrent map (which does locking, although the benchmark is single threaded - but in a separate spawn thread):
inserts: 5213 get : 2594
If I compile the benchmark program without pthread support and run everything in the main thread, I get the following results for our hand backed concurrent map:
inserts: 4441 get : 1180
I compile with the following command:
g++-4.7 -O3 -DNDEBUG -I/tmp/benchmap/sparsehash-2.0.2/src/ -std=c++11 -pthread main.cc
So especially inserts on std::unordered_map
seem to be extremely expensive - 35 seconds vs 3-5 seconds for other maps. Also the lookup time seems to be quite high.
My question: why is this? I read another question on stackoverflow where someone asks, why std::tr1::unordered_map
is slower than his own implementation. There the highest rated answer states, that the std::tr1::unordered_map
needs to implement a more complicated interface. But I can not see this argument: we use a bucket approach in our concurrent_map, std::unordered_map
uses a bucket-approach too (google::dense_hash_map
does not, but than std::unordered_map
should be at least as fast than our hand backed concurrency-safe version?). Apart from that I can not see anything in the interface that force a feature which makes the hash map perform badly...
So my question: is it true that std::unordered_map
seems to be very slow? If no: what is wrong? If yes: what is the reason for that.
And my main question: why is inserting a value into a std::unordered_map
so terrible expensive (even if we reserve enough space at the beginning, it does not perform much better - so rehashing seems not to be the problem)?
First of all: yes the presented benchmark is not flawless - this is because we played around a lot with it and it is just a hack (for example the uint64
distribution to generate ints would in practice not be a good idea, exclude 0 in a loop is kind of stupid etc...).
At the moment most comments explain, that I can make the unordered_map faster by preallocating enough space for it. In our application this is just not possible: we are developing a database management system and need a hash map to store some data during a transaction (for example locking information). So this map can be everything from 1 (user just makes one insert and commits) to billions of entries (if full table scans happen). It is just impossible to preallocate enough space here (and just allocate a lot in the beginning will consume too much memory).
Furthermore, I apologize, that I did not state my question clear enough: I am not really interested in making unordered_map fast (using googles dense hash map works fine for us), I just don't really understand where this huge performance differences come from. It can not be just preallocation (even with enough preallocated memory, the dense map is an order of magnitude faster than unordered_map, our hand backed concurrent map starts with an array of size 64 - so a smaller one than unordered_map).
So what is the reason for this bad performance of std::unordered_map
? Or differently asked: Could one write an implementation of the std::unordered_map
interface which is standard conform and (nearly) as fast as googles dense hash map? Or is there something in the standard that enforces the implementer to chose an inefficient way to implement it?
By profiling I see that a lot of time is used for integer divions. std::unordered_map
uses prime numbers for the array size, while the other implementations use powers of two. Why does std::unordered_map
use prime-numbers? To perform better if the hash is bad? For good hashes it does imho make no difference.
These are the numbers for std::map
:
inserts: 16462 get : 16978
Sooooooo: why are inserts into a std::map
faster than inserts into a std::unordered_map
... I mean WAT? std::map
has a worse locality (tree vs array), needs to make more allocations (per insert vs per rehash + plus ~1 for each collision) and, most important: has another algorithmic complexity (O(logn) vs O(1))!
std::unordered_map is supposedly slow because it has fairly stringent iterator invalidation requirements. In my experience, unless you wring the most performance out of your code as you can, it's not a huge issue; it's generally faster than most casual implementations.
Using unordered_map was at least 20 times slower than map.
If you use more modern Studio like 2017 - then unordered_map much faster than ordered map.
Insertion performance As you can see, using the unordered_map is substantially faster than the map implementation, even for small numbers of elements.
I found the reason: it is a Problem of gcc-4.7!!
With gcc-4.7
inserts: 37728 get : 2985
With gcc-4.6
inserts: 2531 get : 1565
So std::unordered_map
in gcc-4.7 is broken (or my installation, which is an installation of gcc-4.7.0 on Ubuntu - and another installation which is gcc 4.7.1 on debian testing).
I will submit a bug report.. until then: DO NOT use std::unordered_map
with gcc 4.7!
I am guessing that you have not properly sized your unordered_map
, as Ylisar suggested. When chains grow too long in unordered_map
, the g++ implementation will automatically rehash to a larger hash table, and this would be a big drag on performance. If I remember correctly, unordered_map
defaults to (smallest prime larger than) 100
.
I didn't have chrono
on my system, so I timed with times()
.
template <typename TEST> void time_test (TEST t, const char *m) { struct tms start; struct tms finish; long ticks_per_second; times(&start); t(); times(&finish); ticks_per_second = sysconf(_SC_CLK_TCK); std::cout << "elapsed: " << ((finish.tms_utime - start.tms_utime + finish.tms_stime - start.tms_stime) / (1.0 * ticks_per_second)) << " " << m << std::endl; }
I used a SIZE
of 10000000
, and had to change things a bit for my version of boost
. Also note, I pre-sized the hash table to match SIZE/DEPTH
, where DEPTH
is an estimate of the length of the bucket chain due to hash collisions.
Edit: Howard points out to me in comments that the max load factor for unordered_map
is 1
. So, the DEPTH
controls how many times the code will rehash.
#define SIZE 10000000 #define DEPTH 3 std::vector<uint64_t> vec(SIZE); boost::mt19937 rng; boost::uniform_int<uint64_t> dist(std::numeric_limits<uint64_t>::min(), std::numeric_limits<uint64_t>::max()); std::unordered_map<int, long double> map(SIZE/DEPTH); void test_insert () { for (int i = 0; i < SIZE; ++i) { map[vec[i]] = 0.0; } } void test_get () { long double val; for (int i = 0; i < SIZE; ++i) { val = map[vec[i]]; } } int main () { for (int i = 0; i < SIZE; ++i) { uint64_t val = 0; while (val == 0) { val = dist(rng); } vec[i] = val; } time_test(test_insert, "inserts"); std::random_shuffle(vec.begin(), vec.end()); time_test(test_insert, "get"); }
Edit:
I modified the code so that I could change out DEPTH
more easily.
#ifndef DEPTH #define DEPTH 10000000 #endif
So, by default, the worst size for the hash table is chosen.
elapsed: 7.12 inserts, elapsed: 2.32 get, -DDEPTH=10000000 elapsed: 6.99 inserts, elapsed: 2.58 get, -DDEPTH=1000000 elapsed: 8.94 inserts, elapsed: 2.18 get, -DDEPTH=100000 elapsed: 5.23 inserts, elapsed: 2.41 get, -DDEPTH=10000 elapsed: 5.35 inserts, elapsed: 2.55 get, -DDEPTH=1000 elapsed: 6.29 inserts, elapsed: 2.05 get, -DDEPTH=100 elapsed: 6.76 inserts, elapsed: 2.03 get, -DDEPTH=10 elapsed: 2.86 inserts, elapsed: 2.29 get, -DDEPTH=1
My conclusion is that there is not much significant performance difference for any initial hash table size other than making it equal to the entire expected number of unique insertions. Also, I don't see the order of magnitude performance difference that you are observing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With