Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vowpal Wabbit: What hash function is used exactly?

I'd really like to know which hash function is used for feature hashing in Vowpal Wabbit.

I know that the underlying algorithm is Murmurhash 3, but I couldn't make out the details by looking at the VW code on github.

Does anyone know the exactly which hash function is used in VW?

like image 709
Kris Avatar asked Aug 20 '15 13:08

Kris


People also ask

What is Vowpal wabbit used for?

Vowpal Wabbit provides fast, efficient, and flexible online machine learning techniques for reinforcement learning, supervised learning, and more. It is influenced by an ecosystem of community contributions, academic research, and proven algorithms. Microsoft Research is a major contributor to Vowpal Wabbit.

How does the hashing trick work?

It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array. This trick is often attributed to Weinberger et al. (2009), but there exists a much earlier description of this method published by John Moody in 1989.

What is the purpose of using a function to hash vectors into values?

Hash functions are used in conjunction with hash tables to store and retrieve data items or data records. The hash function translates the key associated with each datum or record into a hash code, which is used to index the hash table.

Why is it called a hash function?

The term "hash" comes by way of analogy with its non-technical meaning, to "chop and mix". Indeed, typical hash functions, like the mod operation, "chop" the input domain into many sub-domains that get "mixed" into the output range to improve the uniformity of the key distribution.


1 Answers

The core hash function is the 32-bit Murmur Hash v 3.0 by Austin Appleby.

However, it is incorporated with a slight API change from the original as uniform_hash to adapt it to varying usage scenarios in vw.

You may look at the source at:

  • https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/hash.cc
  • https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/hash.h
  • https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/parse_primitives.cc (look for hashstrings and hashall functions)

As you can see, things are not so simple when you dig into details. The reasons it isn't simple, is that in some places additional effort was made to improve dispersion when features and name-spaces are combined in interactions and some of the reduction algorithms.

Here's a (possibly not 100% complete) list of the details:

  • After hashing there's always a modulo operation based on the current bits value (-b bits, default 18) in order to fit into the weight-vector so the value you get from the hash can be smaller than the straight/naive hash.
  • vowpal wabbit supports (SVMlight style) numerical feature names where you can use numeric "final" values directly rather than hash. By default (--hash strings), feature names that start with a digit are initially used as-is (no hash) but if they continue with some non-digit, the current value computed so far is used as a seed, and the rest of the name is murmur-32-hashed
  • When a name-space exists, the full string that gets hashed is namespace^featurename
  • When name-space interaction options (--redefine, -q, --cubic, etc.) are used the two hash results are combined with a different simpler hash.
  • In some reductions, like --search, a specific (non-zero) seed is used when feeding murmur-hash (see: vowpalwabbit/search.cc) so you may get a different hash value than what's expected.

In case of doubt, you may use the --audit option and vw will output the exact hash-value of each feature in each example. The format is (example):

#
#    UserJack1^mean_karma:3864409:0.12345:0.919323[@3.8964]
#    ^^^^+^^^^ ^^^^^+^^^^ ^^^+^^^ ^^^+^^^ ^^^^+^^^ ^^^+^^^
#        |          |        |       |        |       |
#    namespace featurename hashval value   weight Sum(gradients)
#
like image 70
arielf Avatar answered Oct 08 '22 20:10

arielf