I've a little bit confusion how unordered_map works and what buckets are and how thet are managed.
From this blog post, unordered_map is vector of vectors.
My questions are:
Sorry for these questions, but I didn't find any detailed explanation how this structure works (on cppreference.com for example).
The C++ function std::unordered_map::find() finds an element associated with key k. If operation succeeds then methods returns iterator pointing to the element otherwise it returns an iterator pointing the map::end().
map is used to store elements as key,value pairs in sorted order. unordered_map is used to store elements as key,value pairs in non-sorted order.
Internally unordered_map is implemented using Hash Table, the key provided to map is hashed into indices of a hash table which is why the performance of data structure depends on the hash function a lot but on average, the cost of search, insert, and delete from the hash table is O(1).
unordered_map is a data structure that is used to store data in the form of pairs of keys and their corresponding values. Unordered_map uses a hashing function to store a key-value pair, due to which the average time complexity for searching a key-value pair becomes O(1).
std::unordered_map is the standard C++ hash table. It used to be called hash_map in STL, but missed the boat when many of STL's interfaces were merged into C++ in 1998, and by 2011, so many libraries had their own hash_map, C++ had to pick another name (I think "unordered" was a great choice; assuming order in a hash table is a common source of bugs).
is it correct to assume that the buckets are the "internal" vectors?
no, it is both incorrect (incompatible with iterator invalidation requirements) and dangerous (under that assumption you may end up subtracting pointers to elements in the same bucket).
In real life, the buckets are linked lists; e.g.
unordered_map
is a unique_ptr to an array of linked lists of __hash_nodesunordered_map
is a pointer to an array of linked lists of _Hash_nodesis it correct to assume that we have to define the equal method on the key type (in addiction to the hash operator) in order to find the key inside the bucket?
Yes, locating the key in a bucket is exactly what the 4th template parameter of std::unordered_map
is for (does not need to call the "equal method on the key type" literally, of course)
what is the external vector (hash table) size by default?
There is no "external vector". The number of buckets for a default-constructed std::unordered_map
is implementation-defined, you can query it with bucket_count.
what is the internal vector size by default?
There is no "internal vector". The size of any given bucket equals the number of elements currently placed in the bucket. You can query it with bucket_size
what happens if the number of elements in one bucket becomes too big?bor in other words, when the rehash happens?
Nothing happens if the number of elements in one bucket becomes too big. But if average number of elements per bucket (aka load_factor) exceeds max_load_factor, rehash happens (e.g. on insert)
This may help you understand the buckets: http://www.cplusplus.com/reference/unordered_map/unordered_map/bucket_count/ http://www.cplusplus.com/reference/unordered_map/unordered_map/max_load_factor/
But generally, yes, the buckets are something like internal vectors. It needs an equality operator (or a predicate) to distinguish keys that had the same hash as you suggest.
The initial number of buckets is possibly 0. It can be set via rehash() or reserve() (they have slightly different semantics.)
http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/
In an ideal case, each bucket will only have the one item. You can check this by using bucket_size. When the load factor (total items vs. bucket count) gets high, it rehashes automatically.
By default, it'll aim for a 1:1 load factor. If the hash function is good, this may last until max_bucket_count items are inserted.
Keep in mind that the specific implementation of this may vary. Each implementation (e.g. from different platforms or standard libraries) really only needs to have the correct semantics.
If these answers are important to your program, you can query the values as I've described. If you're just trying to wrap your head around it, query them in some test scenarios and it may become more clear.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With