Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby: Why does `#hash` need to overridden whenever `#eql?` is overridden?

Tags:

ruby

In this presentation the speaker has created a value class.

In implementing it, he overrides #eql? and says that in Java development, the idiom is that whenever you override #eql? you must override #hash.

class Weight
  # ...

  def hash
    pounds.hash
  end

  def eql?(other)
    self.class == other.class &&
      self.pounds == other.pounds
  end
  alias :== eql?
end

Firstly, what is the #hash method? I can see it returns an integer.

> 1.hash
=> -3708808305943022538

> 2.hash
=> 1196896681607723080

> 1.hash
=> -3708808305943022538

Using pry I can see that an integer responds to #hash but I cannot see where it inherits the method from. It's not defined on Numeric or Object. If I knew what this method did, I would probably understand why it needs to be overridden at the same time as #eql?.

So, why does #hash need to be overridden whenever eql? is overridden?

like image 901
user3574603 Avatar asked Mar 02 '19 17:03

user3574603


1 Answers

Firstly, what is the #hash method? I can see it returns an integer.

The #hash method is supposed to return a hash of the receiver. (The name of the method is a bit of a giveaway).

Using pry I can see that an integer responds to #hash but I cannot see where it inherits the method from.

There are dozens of questions of the type "Where does this method come from" on [so], and the answer is always the same: the best way to know where a method comes from, is to simply ask it:

hash_method = 1.method(:hash)
hash_method.owner #=> Kernel

So, #hash is inherited from Kernel. Note however, that there is a bit of a peculiar relationship between Object and Kernel, in that some methods that are implemented in Kernel are documented in Object or vice versa. This probably has historic reasons, and is now an unfortunate fact of life in the Ruby community.

Unfortunately, for reasons I don't understand, the documentation for Object#hash was deleted in 2017 in a commit ironically titled "Add documents". It is, however, still available in Ruby 2.4 (bold emphasis mine):

hashinteger

Generates an Integer hash value for this object. This function must have the property that a.eql?(b) implies a.hash == b.hash.

The hash value is used along with eql? by the Hash class to determine if two objects reference the same hash key. […]

So, as you can see, there is a deep and important relationship between #eql? and #hash, and in fact the correct behavior of methods that use #eql? and #hash depends on the fact that this relationship is maintained.

So, we know that the method is called #hash and thus likely computes a hash. We know it is used together with eql?, and we know that it is used in particular by the Hash class.

What does it do, exactly? Well, we all know what a hash function is: it is a function that maps a larger, potentially infinite, input space into a smaller, finite, output space. In particular, in this case, the input space is the space of all Ruby objects, and the output space is the "fast integers" (i.e. the ones that used to be called Fixnum).

And we know how a hash table works: values are placed in buckets based on the hash of their keys, if I want to find a value, then I only need to compute the hash of the key (which is fast) and know which bucket I find the value in (in constant time), as opposed to e.g. an array of key-value-pairs, where I need to compare the key against every key in the array (linear search) to find the value.

However, there is a problem: Since the output space of a hash is smaller than the input space, there are different objects which have the same hash value and thus end up in the same bucket. Thus, when two objects have different hash values, I know for a fact that they are different, but if they have the same hash value, then they could still be different, and I need to compare them for equality to be sure – and that's where the relationship between hash and equality comes from. Also note that when many keys and up in the same bucket, I will again have to compare the search key against every key in the bucket (linear search) to find the value.

From all this we can conclude the following properties of the #hash method:

  • It must return an Integer.
  • Not only that, it must return a "fast integer" (equivalent to the old Fixnums).
  • It must return the same integer for two objects that are considered equal.
  • It may return the same integer for two objects that are considered unequal.
  • However, it only should do so with low probability. (Otherwise, a Hash may degenerate into a linked list with highly degraded performance.)
  • It also should be hard to construct objects that are unequal but have the same hash value deliberately. (Otherwise, an attacker can force a Hash to degenerate into a linked list as a form of Degradation-of-Service attack.)
like image 159
Jörg W Mittag Avatar answered Oct 17 '22 08:10

Jörg W Mittag