Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hashing floating point values

Recently, I was curious how hash algorithms for floating points worked, so I looked at the source code for boost::hash_value. It turns out to be fairly complicated. The actual implementation loops over each digit in the radix and accumulates a hash value. Compared to the integer hash functions, it's much more involved.

My question is: why should a floating-point hash algorithm be any more complicated? Why not just hash the binary representation of the floating point value as if it was an integer?

Like:

std::size_t hash_value(float f)
{
  return hash_value(*(reinterpret_cast<int*>(&f)));
}

I realize that float is not guaranteed to be the same size as int on all systems, but that sort of thing could be handled with a few template meta-programs to deduce an integral type that is the same size as float. So what is the advantage of introducing an entirely different hash function that specifically operates on floating point types?

like image 898
Channel72 Avatar asked Sep 13 '11 14:09

Channel72


People also ask

How is hashing calculated?

Hashing is simply passing some data through a formula that produces a result, called a hash. That hash is usually a string of characters and the hashes generated by a formula are always the same length, regardless of how much data you feed into it. For example, the MD5 formula always produces 32 character-long hashes.

How do I find my floating point number?

The typeof operator is used to check the data type of the passed value. The isNaN() method checks if the passed value is a number. The Number. isInteger() method is used to check if the number is an integer value.

What are the properties of good hashing?

Characteristics of a Good Hash Function. There are four main characteristics of a good hash function: 1) The hash value is fully determined by the data being hashed. 2) The hash function uses all the input data. 3) The hash function "uniformly" distributes the data across the entire set of possible hash values.

What is a floating-point variable?

A floating point type variable is a variable that can hold a real number, such as 4320.0, -3.33, or 0.01226. The floating part of the name floating point refers to the fact that the decimal point can “float”; that is, it can support a variable number of digits before and after the decimal point.


1 Answers

Take a look at https://svn.boost.org/trac/boost/ticket/4038

In essence it boils down to two things:

  • Portability: when you take the binary representation of a float, then on some platform it could be possible that a float with a same value has multiple representations in binary. I don't know if there is actually a platform where such an issue exists, but with the complication of denormelized numbers, I'm not sure if this might actually happen.

  • the second issue is what you proposed, it might be that sizeof(float) does not equal sizeof(int).

I did not find anyone mentioning that the boost hash indeed avoids fewer collisions. Although I assume that separating the mantissa from the exponent might help, but the above link does not suggest that this was the driving design decision.

like image 100
H. Brandsmeier Avatar answered Sep 22 '22 12:09

H. Brandsmeier