I have a situation where I need to optimize the creation of an unordered set. Expected number of elements is around 5-25M. My first thought is that I should prepare all data beforehand and do something like
unordered_set s(data);
instead of
for (auto& elem : data)
s.insert(elem);
Can the STL unordered set use bulk loading methods and speed up its creation? How can I tweak the parameters of the hash table (bucket size etc) if I know prior to the table construction the expected number of elements?
This question is quite broad and interesting.
First of all, there is a special method called reserve - it allows you to pre-allocate the storage for a number of elements before actually inserting them. Pre-allocating sufficient memory (and avoiding re-locations during the instertion) is a very powerful approach, which is commonly used for large data sets. Note, that it is also available for various standard containers, including vector
, unordered_map
etc.
Secondly, if you're using C++11, you might benefit from using move-semantics while inserting the elements into your container (of course, given you don't need them in your feed once they are placed in the set, which should be true for 5 to 25 millions of objects).
These two techniques are a good start. You may need to tune it further by setting different hashing function, or even choosing different implementation of an unordered_set. But at this point, you should provide more information: what are your value objects and what is their life-cycle; what insertion time do you find acceptable in your application.
EDIT: of course it's all about C++11, as unordered_set was not available prior to it. Shame on me :)
My focus now is on whether I can use functions like rehash to notify the table for the upcoming size
Suppose you call
unordered_set s(begin(data), end(data));
while the standard doesn't dictate an implementation, a good implementation will be able to discern the number of elements, and preallocate the size accordingly. If you look at the source code used by gcc (by me /usr/include/c++/5/tr1/hashtable.h
), for example, it uses
_M_bucket_count = std::max(_M_rehash_policy._M_next_bkt(__bucket_hint),
_M_rehash_policy.
_M_bkt_for_elements(__detail::
__distance_fw(__f,
__l)));
_M_buckets = _M_allocate_buckets(_M_bucket_count);
so it already preallocates size based on the number of elements.
The problem might be different, though. If you look at the documentation, it states:
constructs the container with the contents of the range [first, last). Sets max_load_factor() to 1.0.
This saves space, but might cause collisions. To reduce the collisions, you could use
unordered_set s(begin(data), end(data), k * data.size());
where k > 1 is some constant. This corresponds to a load factor that is 1 / k. YMMV.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With