I'd really like to know which hash function is used for feature hashing in Vowpal Wabbit.
I know that the underlying algorithm is Murmurhash 3, but I couldn't make out the details by looking at the VW code on github.
Does anyone know the exactly which hash function is used in VW?
Vowpal Wabbit provides fast, efficient, and flexible online machine learning techniques for reinforcement learning, supervised learning, and more. It is influenced by an ecosystem of community contributions, academic research, and proven algorithms. Microsoft Research is a major contributor to Vowpal Wabbit.
It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array. This trick is often attributed to Weinberger et al. (2009), but there exists a much earlier description of this method published by John Moody in 1989.
Hash functions are used in conjunction with hash tables to store and retrieve data items or data records. The hash function translates the key associated with each datum or record into a hash code, which is used to index the hash table.
The term "hash" comes by way of analogy with its non-technical meaning, to "chop and mix". Indeed, typical hash functions, like the mod operation, "chop" the input domain into many sub-domains that get "mixed" into the output range to improve the uniformity of the key distribution.
The core hash function is the 32-bit Murmur Hash v 3.0 by Austin Appleby.
However, it is incorporated with a slight API change from the original as uniform_hash
to adapt it to varying usage scenarios in vw
.
You may look at the source at:
hashstrings
and hashall
functions)As you can see, things are not so simple when you dig into details. The reasons it isn't simple, is that in some places additional effort was made to improve dispersion when features and name-spaces are combined in interactions and some of the reduction algorithms.
Here's a (possibly not 100% complete) list of the details:
-b bits
, default 18) in order to fit into the weight-vector so the value you get from the hash can be smaller than the straight/naive hash.--hash strings
), feature names that start with a digit are initially used as-is (no hash) but if they continue with some non-digit, the current value computed so far is used as a seed, and the rest of the name is murmur-32-hashed--redefine
, -q
, --cubic
, etc.) are used the two hash results are combined with a different simpler hash.--search
, a specific (non-zero) seed is used when feeding murmur-hash (see: vowpalwabbit/search.cc
) so you may get a different hash value than what's expected.In case of doubt, you may use the --audit
option and vw
will output the exact hash-value of each feature in each example. The format is (example):
#
# UserJack1^mean_karma:3864409:0.12345:0.919323[@3.8964]
# ^^^^+^^^^ ^^^^^+^^^^ ^^^+^^^ ^^^+^^^ ^^^^+^^^ ^^^+^^^
# | | | | | |
# namespace featurename hashval value weight Sum(gradients)
#
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With