Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tabulation hashing and N3980

I am having trouble adapting the pending C++1z proposal N3980 by @HowardHinnant to work with tabulation hashing.

Ab initio computing a tabulation hash works the same as for the hashing algorithm (Spooky, Murmur, etc.) described in N3980. It is not that complicated: just serialize an object of any user-defined type through hash_append() and let the hash function update a pointer into a table of random numbers as you go along.

The trouble starts when trying to implement one of the nice properties of tabulation hashing: very cheap to compute incremental updates to the hash if an object is mutated. For "hand-made" tabulation hashes, one just recomputes the hash of the object's affected bytes.

My question is: how to communicate incremental updates to a uhash<MyTabulationAlgorithm> function object while keeping true to the central theme of N3980 (Types don't know #)?

To illustrate the design difficulties: say I have a user-defined type X with N data members xi of various types Ti

struct X 
{
   T1 x1;
   ...
   TN xN;
};

Now create an object and compute its hash

X x { ... }; // initialize
std::size_t h = uhash<MyTabulationAlgorithm>(x);

Update a single member, and recompute the hash

x.x2 = 42;
h ^= ...; // ?? I want to avoid calling uhash<>() again

I could compute the incremental update as something like

h ^= hash_update(x.x2, start, stop); 

where [start, stop) represents the range of the table of random numbers that correspond to the x2 data member. However, in order to incrementally (i.e. cheaply!) update the hash for arbitrary mutations, every data member needs to somehow know its own subrange in the serialized byte stream of its containing class. This doesn't feel like the spirit of N3980. E.g., adding new data members to the containing class, would change the class layout and therefore the offsets in the serialized byte stream.

Application: tabulation hashing is very old, and it has recently been shown that it has very nice mathematical properties (see the Wikipedia link). It's also very popular in the board game programming community (computer chess and go e.g.) where it goes under the name of Zobrist hashing. There, a board position plays the role of X, and a move the role of a small update (move a piece from its source to its destination e.g.). It would be nice if N3980 could not only be adapted to such tabulation hashing, but that it can also accomodate the cheap incremental updates.

like image 820
TemplateRex Avatar asked Dec 17 '14 13:12

TemplateRex


1 Answers

It seems that you should be able to do this by telling MyTabulationAlgorithm to ignore the values of all class members except that which has changed:

x.x2 = 42;
IncrementalHashAdaptor<MyTabulationAlgorithm, T2> inc{x.x2};
hash_append(inc, x);
h ^= inc;

All IncrementalHashAdaptor has to do is check the memory range it is passed to see whether x2 is included in it:

template<class HashAlgorithm, class T>
struct IncrementalHashAdaptor
{
    T& t;
    HashAlgorithm h = {};
    bool found = false;
    void operator()(void const* key, std::size_t len) noexcept
    {
        if (/* t contained within [key, key + len) */) {
            assert(!found);
            found = true;
            char const* p = addressof(t);
            h.ignore(key, (p - key));
            h(p, sizeof(T));
            h.ignore(p + sizeof(T), len - (p - key) - sizeof(T));
        }
        else {
            h.ignore(key, len);
        }
    }
    operator std:size_t() const { assert(found); return h; }
};

Obviously, this will only work for members whose object location is both possible to determine externally and corresponds to the memory block passed to the hash algorithm; but this should correspond to the vast majority of cases.

You would probably want to wrap IncrementalHashAdaptor and the following hash_append into a uhash_incremental utility; this is left as an exercise for the reader.

There is a question mark over performance; assuming HashAlgorithm::ignore(...) is visible to the compiler and is uncomplicated it should optimize well; if this does not occur you should be able to calculate the byte-stream address of X::x2 at program startup using a similar strategy.

like image 77
ecatmur Avatar answered Oct 31 '22 12:10

ecatmur