Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unordered_map excess calls to hash function

The following code results with unexplained calls to the hash function:

namespace foo {
    using Position = tuple <int, int, int>;
    
    std::ostream& operator<<(std::ostream& out, const Position& pos) noexcept{
        return out << get<0>(pos) << ", " << get<1>(pos) << ", " << get<2>(pos);
    }

    struct hashFunc{
        std::size_t operator()(const Position& pos) const noexcept{
            int res = get<0>(pos) * 17 ^ get<1>(pos) * 11 ^ get<2>(pos);
            cout << "@@@ hash function called for key: " << pos 
                 << ", hash: " << res << endl;
            return res;
        }
    };

    template<typename T>
    void print_buckets(T&& map) {
        auto num_buckets = map.bucket_count();
        cout << "------------------------------" << endl;
        cout << "NUM BUCKETS: " << num_buckets << endl;
        for(size_t i=0; i<num_buckets; ++i) {
            auto bucket_size = map.bucket_size(i);
            if(bucket_size) {
                cout << "BUCKET " << i << " size: " << bucket_size << endl;        
            }
        }
        cout << "------------------------------" << endl;
    }
}

Main:

using namespace foo;

int main() {
    // note: bucket_count specified
    unordered_map <Position, std::string, hashFunc> test(10); 
    
    auto x = tuple{1,0,0};
    auto z = tuple{0,1,0};
    auto w = tuple{0,0,1};
            
    cout << "==================================" << endl;
    cout << "about to insert: " << x << endl;
    test[x] =  "hello";
    print_buckets(test);
    cout << "after insert of: " << x << endl;
    
    cout << "==================================" << endl;
    cout << "about to insert: " << z << endl;
    test[z] = "hey";
    print_buckets(test);
    cout << "after insert of: " << z << endl;
    
    cout << "==================================" << endl;
    cout << "about to insert: " << w << endl;
    test.insert({w, "hello"});
    print_buckets(test);
    cout << "after insert of: " << w << endl;    
    cout << "==================================" << endl;
}

Output:

==================================
about to insert: 1, 0, 0
@@@ hash function called for key: 1, 0, 0, hash: 17
------------------------------
NUM BUCKETS: 11
BUCKET 6 size: 1
------------------------------
after insert of: 1, 0, 0
==================================
about to insert: 0, 1, 0
@@@ hash function called for key: 0, 1, 0, hash: 11
@@@ hash function called for key: 1, 0, 0, hash: 17   <= why?
------------------------------
NUM BUCKETS: 11
@@@ hash function called for key: 1, 0, 0, hash: 17   <= why?
BUCKET 0 size: 1
BUCKET 6 size: 1
------------------------------
after insert of: 0, 1, 0
==================================
about to insert: 0, 0, 1
@@@ hash function called for key: 0, 0, 1, hash: 1
@@@ hash function called for key: 0, 1, 0, hash: 11   <= why?
------------------------------
NUM BUCKETS: 11
@@@ hash function called for key: 1, 0, 0, hash: 17   <= why?
BUCKET 0 size: 1
@@@ hash function called for key: 0, 1, 0, hash: 11   <= why?
BUCKET 1 size: 1
BUCKET 6 size: 1
------------------------------
after insert of: 0, 0, 1
==================================

Code (same behavior for gcc and clang)


Notes:

1. Trying the same without the bucket_count parameter for the constructor, calls to hash function become even more excessive, due to rehash. But in the scenario above it seems that there is no rehash and there are no collisions.

2. Related, but specifically on MSVC: Inserting to std::unordered_map calls hash function twice in MSVC++'s STL, bad design or special reason?

like image 625
Amir Kirsh Avatar asked Jul 22 '20 19:07

Amir Kirsh


People also ask

How do you avoid a hash collision in C++?

It's not possible to avoid collisions with a hash. If you have no collisions then you don't have a hashing function. The goal is to minimize collisions, not eliminate them. You'll always have contention unless you have more possible hashes than possible inputs, which sort of defeats the point of hashing.

Is unordered_map hash table?

Unordered_map IS a hash table actually. You may want to use a better example as, as is the second insert will fail since it has the same key.

Why map is faster than unordered_map?

map just has a few house-keeping pointers, and memory for each object. Contrarily, unordered_map has a big array (these can get quite big in some implementations), and then additional memory for each object. If you need to be memory-aware, map should prove better, because it lacks the large array.

What is the time complexity of unordered_map?

The time complexity of map operations is O(log n) while for unordered_map, it is O(1) on average.


1 Answers

Firstly, a couple of observations:

  • The unordered map is both a hash table, and a singly-linked list.

    See here that begin returns an iterator which models LegacyForwardIterator.

  • Inserting an entry into the map requires updating both the hash table and the linked list.

Secondly, a couple of notes on these containers' implementation decisions:

  • For singly-linked lists, it's common to have a sentinel node which doesn't contain any data (for something like Node<T>, it'll still have a T, just default-initialized). We only want it for its next pointer, because it helps keep list operations regular (ie, we don't have to write insert-at-the-head and insert-after-node as different special cases).

  • For hash tables (assuming linked-list buckets, since it's required by the standard) we can either use Node table[N] (so each bucket has its own sentinel preallocated) or Node* table[N].

    In this case, since we're actually using Node<T> and don't know the size of T, it seems reasonable to store a pointer for each bucket.

  • For a hash table which is also a singly-linked list, it makes sense to use the per-bucket list as (part of) the all-elements list. Otherwise we'd need to store two pointers per node, next_in_bucket and next_in_list.

    This means that the "sentinel" (one-before-the-beginning) node pointed to by a bucket is actually the last node of the previous bucket ... except for the bucket at the front of the list, when it really is the overall list sentinel.

    The comments in the code say

      /* ...
      *  The non-empty buckets contain the node before the first node in the
      *  bucket. This design makes it possible to implement something like a
      *  std::forward_list::insert_after on container insertion and
      *  std::forward_list::erase_after on container erase
      *  calls. _M_before_begin is equivalent to
      *  std::forward_list::before_begin. Empty buckets contain
      *  nullptr.  Note that one of the non-empty buckets contains
      *  &_M_before_begin which is not a dereferenceable node so the
      *  node pointer in a bucket shall never be dereferenced, only its
      *  next node can be.
    

    (the sentinel is _M_before_begin in this code)

So, when we add an element to an already-populated bucket, the steps are roughly

void insert_to_non_empty_bucket(Node *n, Key k) {
  Node *sentinel = table[k];
  n->next = sentinel->next;
  sentinel->next = n;
}

Note again that we don't know or care whether the sentinel here is the last element of the previous bucket, or the overall list sentinel. The code is the same either way (which was one of the reasons for using a sentinel in the first place).

However, when we add the first element to an empty bucket (and it is not the only non-empty bucket), we have one extra step: we need to update the sentinel pointer for the next bucket, to point to our new node. Otherwise we'd have two buckets both pointing to the list sentinel.

void insert_to_empty_bucket(Node *n, Key k) {
  Node *sentinel = &list_sentinel; // ie, &_M_before_begin
  n->next = sentinel->next;
  sentinel->next = n;

  // update the *next* bucket in the table
  table[n->next->key] = n;
}

Finally: in this implementation, Node doesn't cache the key, so there is no n->next->key. There is actually a trait controlling this, but it's clearly false in this case, which means that final line has to re-compute the hash in order to update the next bucket.


NB. just to clarify, when I say previous bucket or next bucket, I'm just talking about position in the list, where buckets appear in reverse order of when they became non-empty. It doesn't have anything to do with position in the table, or imply any intrinsic ordering.

like image 157
Useless Avatar answered Oct 13 '22 08:10

Useless