Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Build large(ish) unordered set with all data available at the beginning

I have a situation where I need to optimize the creation of an unordered set. Expected number of elements is around 5-25M. My first thought is that I should prepare all data beforehand and do something like

unordered_set s(data); 

instead of

for (auto& elem : data)
    s.insert(elem); 

Can the STL unordered set use bulk loading methods and speed up its creation? How can I tweak the parameters of the hash table (bucket size etc) if I know prior to the table construction the expected number of elements?

like image 472
Lorah Attkins Avatar asked Feb 05 '23 06:02

Lorah Attkins


2 Answers

This question is quite broad and interesting.

First of all, there is a special method called reserve - it allows you to pre-allocate the storage for a number of elements before actually inserting them. Pre-allocating sufficient memory (and avoiding re-locations during the instertion) is a very powerful approach, which is commonly used for large data sets. Note, that it is also available for various standard containers, including vector, unordered_map etc.

Secondly, if you're using C++11, you might benefit from using move-semantics while inserting the elements into your container (of course, given you don't need them in your feed once they are placed in the set, which should be true for 5 to 25 millions of objects).

These two techniques are a good start. You may need to tune it further by setting different hashing function, or even choosing different implementation of an unordered_set. But at this point, you should provide more information: what are your value objects and what is their life-cycle; what insertion time do you find acceptable in your application.

EDIT: of course it's all about C++11, as unordered_set was not available prior to it. Shame on me :)

like image 150
iehrlich Avatar answered Mar 01 '23 18:03

iehrlich


My focus now is on whether I can use functions like rehash to notify the table for the upcoming size

Suppose you call

unordered_set s(begin(data), end(data)); 

while the standard doesn't dictate an implementation, a good implementation will be able to discern the number of elements, and preallocate the size accordingly. If you look at the source code used by gcc (by me /usr/include/c++/5/tr1/hashtable.h), for example, it uses

 _M_bucket_count = std::max(_M_rehash_policy._M_next_bkt(__bucket_hint),
                _M_rehash_policy.
                _M_bkt_for_elements(__detail::
                            __distance_fw(__f,
                                  __l)));
 _M_buckets = _M_allocate_buckets(_M_bucket_count);

so it already preallocates size based on the number of elements.

The problem might be different, though. If you look at the documentation, it states:

constructs the container with the contents of the range [first, last). Sets max_load_factor() to 1.0.

This saves space, but might cause collisions. To reduce the collisions, you could use

unordered_set s(begin(data), end(data), k * data.size()); 

where k > 1 is some constant. This corresponds to a load factor that is 1 / k. YMMV.

like image 31
Ami Tavory Avatar answered Mar 01 '23 16:03

Ami Tavory