Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does LLVM choose open-addressing hash table to implement llvm::StringMap?

Many sources say open-addressing, the hash collision handling approach used in llvm::StringMap, is not stable. Open-addressing is said to be inferior to chaining when the load factor is high (which is imaginable).

But if the load factor is low, there will be a huge memory waste for open-addressing, because I have to allocate Bucket_number * sizeof(Record) bytes in memory, even if the majority of buckets do not hold a record.

So my question is, what is the reason for LLVM to choose open-addressing over separate-chaining? Is it merely because of the advantage in speed gained by cache locality (records are stored in buckets themselves)?

Thanks:)

Edit: C++11 standard's requirements on std::unordered_setand std::unordered_map imply the chaining approach, not open-addressing. Why does LLVM choose an hash collision handling method that can't even satisfy the C++ standard? Are there any special use cases for llvm::StringMap that warrants this deviation? (Edit: this slide deck compares the performance of several LLVM data structures with that of STL counterparts)

Another question, by the way:

How does llvm::StringMap guarantee that keys' hash values are not recomputed when growing? The manual says:

hash table growth does not recompute the hash values for strings already in the table.

like image 897
Leedehai Avatar asked Jul 29 '17 09:07

Leedehai


1 Answers

Let us look at the implementation. Here we see that the table is stored as parallel arrays of indirect pointers to records as well as any array of cached 32-bit hash codes, that is separate arrays-of-structures.

Effectively:

struct StringMap {
 uint32_t hashcode[CAPACITY];
 StringMapEntry *hashptr[CAPACITY];
};

Except that the capacity is dynamic and the load factor would appear to be maintained at between 37.5% and 75% of capacity.

For N records a load factor F this yields N/F pointers plus N/F integers for the open-addressed implementation as compared to N*(1+1/F) pointers plus N integers for the equivalent chained implementation. On a typical 64-bit system the open-address version is between ~4% larger and ~30% smaller.

However as you rightly suspected the main advantage here lies in cache effects. Beyond on average cache reducing contention by shrinking the data the filtering of collisions boils down to a linear reprobing of consecutive 32-bit hash keys, without examining any further information. Rejecting a collision is therefore much faster the chained case in which the link must be followed into what is likely uncached storage, and therefore a significantly higher load factor may be used. On the flip side one additional likely cache miss must be taken on the pointer lookup table, however this is a constant which does not degrade with load equivalent to one chained collision.

Effectively:

StringMapEntry *StringMap::lookup(const char *text) {
    for(uint32_t *scan = &hashcode[hashvalue % CAPACITY]; *scan != SENTINEL; ++scan) {
        uint32_t hash_value = hash_function(text);
        if(hash_value == *scan) {
            StringMapEntry *entry = p->hashptr[scan - hashcode];
            if(!std::strcmp(entry->text, text))
                return entry;
            }
        }
    }
}

Plus subtleties such as wrapping.

As for your second question the optimization is to precalculate and store the hash keys. This waste some storage but prevents the expensive operation of examining potentially long variable-length strings unless a match is almost certain. And in degenerate cases, complex template name mangling, may be hundreds of characters.

A further optimization in RehashTable is to use a power-of-two instead of a prime table size. This insures that growing the table effectively brings one additional hash code bit into play and de-interleaves the doubled table into two consecutive target arrays, effectively rendering the operation a cache-friendly linear sweep.

like image 67
doynax Avatar answered Oct 23 '22 00:10

doynax