Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Not sure how unordered_map works

I've a little bit confusion how unordered_map works and what buckets are and how thet are managed.

From this blog post, unordered_map is vector of vectors.

My questions are:

  • is it correct to assume that the buckets are the "internal" vectors?
  • since each bucket (vector) can contains multiple elements, given by an hash collision on the hash table (the "external" vector), and since we have to scan this internal vector (in linear time), is it correct to assume that we have to define the equal method on the key type (in addiction to the hash operator) in order to find the key inside the bucket?
  • what is the external vector (hash table) size by default?
  • what is the internal vector size by default?
  • what happens if the number of elements in one bucket becomes too big?bor in other words, when the rehash happens?

Sorry for these questions, but I didn't find any detailed explanation how this structure works (on cppreference.com for example).

like image 217
justHelloWorld Avatar asked Apr 17 '16 05:04

justHelloWorld


People also ask

How does unordered_map find work?

The C++ function std::unordered_map::find() finds an element associated with key k. If operation succeeds then methods returns iterator pointing to the element otherwise it returns an iterator pointing the map::end().

Why do you use unordered_map why not ordered map?

map is used to store elements as key,value pairs in sorted order. unordered_map is used to store elements as key,value pairs in non-sorted order.

How unordered maps work internally?

Internally unordered_map is implemented using Hash Table, the key provided to map is hashed into indices of a hash table which is why the performance of data structure depends on the hash function a lot but on average, the cost of search, insert, and delete from the hash table is O(1).

How does unordered_map store values?

unordered_map is a data structure that is used to store data in the form of pairs of keys and their corresponding values. Unordered_map uses a hashing function to store a key-value pair, due to which the average time complexity for searching a key-value pair becomes O(1).


2 Answers

std::unordered_map is the standard C++ hash table. It used to be called hash_map in STL, but missed the boat when many of STL's interfaces were merged into C++ in 1998, and by 2011, so many libraries had their own hash_map, C++ had to pick another name (I think "unordered" was a great choice; assuming order in a hash table is a common source of bugs).

is it correct to assume that the buckets are the "internal" vectors?

no, it is both incorrect (incompatible with iterator invalidation requirements) and dangerous (under that assumption you may end up subtracting pointers to elements in the same bucket).

In real life, the buckets are linked lists; e.g.

  • LLVM libc++ unordered_map is a unique_ptr to an array of linked lists of __hash_nodes
  • GNU libstdc++ unordered_map is a pointer to an array of linked lists of _Hash_nodes

is it correct to assume that we have to define the equal method on the key type (in addiction to the hash operator) in order to find the key inside the bucket?

Yes, locating the key in a bucket is exactly what the 4th template parameter of std::unordered_map is for (does not need to call the "equal method on the key type" literally, of course)

what is the external vector (hash table) size by default?

There is no "external vector". The number of buckets for a default-constructed std::unordered_map is implementation-defined, you can query it with bucket_count.

what is the internal vector size by default?

There is no "internal vector". The size of any given bucket equals the number of elements currently placed in the bucket. You can query it with bucket_size

what happens if the number of elements in one bucket becomes too big?bor in other words, when the rehash happens?

Nothing happens if the number of elements in one bucket becomes too big. But if average number of elements per bucket (aka load_factor) exceeds max_load_factor, rehash happens (e.g. on insert)

like image 70
Cubbi Avatar answered Nov 09 '22 16:11

Cubbi


This may help you understand the buckets: http://www.cplusplus.com/reference/unordered_map/unordered_map/bucket_count/ http://www.cplusplus.com/reference/unordered_map/unordered_map/max_load_factor/

But generally, yes, the buckets are something like internal vectors. It needs an equality operator (or a predicate) to distinguish keys that had the same hash as you suggest.

The initial number of buckets is possibly 0. It can be set via rehash() or reserve() (they have slightly different semantics.)

http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/

In an ideal case, each bucket will only have the one item. You can check this by using bucket_size. When the load factor (total items vs. bucket count) gets high, it rehashes automatically.

By default, it'll aim for a 1:1 load factor. If the hash function is good, this may last until max_bucket_count items are inserted.

Keep in mind that the specific implementation of this may vary. Each implementation (e.g. from different platforms or standard libraries) really only needs to have the correct semantics.

If these answers are important to your program, you can query the values as I've described. If you're just trying to wrap your head around it, query them in some test scenarios and it may become more clear.

like image 22
Todd Christensen Avatar answered Nov 09 '22 14:11

Todd Christensen