Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Order-independent Hash Algorithm

I am currently working on a collection library for my custom programming language. I already have several data types (Collection, List, Map, Set) and implementations for them (mutable and immutable), but what I was missing so far was hashCode and equals. While these are no problem for Lists as they are ordered collections, the play a special role for Sets and Maps. Two Sets are considered equal if they have the same size and the same elements, and the order in which the Sets maintain them should not make a difference in their equality. Because of the equals-hashCode-contract, the hashCode implementation also has to reflect this behavior, meaning that two sets with the same elements but different ordering should have the same hash code. (The same applies for Maps, which are technically a Set of Key-Value-Pairs)

Example (Pseudocode):

let set1: Set<String> = [ "a", "b", "c" ]
let set2: Set<String> = [ "b", "c", "a" ]
set1 == set2       // should return true
set1.hashCode == set2.hashCode // should also return true

How would I implement a reasonably good hash algorithm for which the hashCodes in the above example return the same value?

like image 376
Clashsoft Avatar asked Jun 09 '15 14:06

Clashsoft


People also ask

Does order matter in hashing?

Just XOR each hash and the order wont matter, plus the hash size will be fixed rather than grow with the size of the collection.

What is the strongest hash algorithm?

1 SHA-256 or SHA-2 SHA-1 is a 160-bit hash and SHA-256 generates an almost-unique 256-bit (32-byte) signature for a text. SHA-256 is one of the successor and strongest hash functions to SHA-1. It is not much more complex to code than SHA-1 and has not yet been compromised in any way [1].

What is pairwise independent hashing?

Definition 8. A pairwise-independent hash family is a set of functions H = {h : [m] → [l]} such that for all a, b ∈ [m] and all c, d ∈ [l] we have Prh[h(a) = c∧h(b) = d]=1/l2, where the probability is taken over choosing a uniformly random h ∈ H.

What are independent hash functions?

In computer science, a family of hash functions is said to be k-independent, k-wise independent or k-universal if selecting a function at random from the family guarantees that the hash codes of any designated k keys are independent random variables (see precise mathematical definitions below).


1 Answers

The JDK itself proposes the following solution to this problem. The contract of the java.util.Set interface states:

Returns the hash code value for this set. The hash code of a set is defined to be the sum of the hash codes of the elements in the set, where the hash code of a null element is defined to be zero. This ensures that s1.equals(s2) implies that s1.hashCode()==s2.hashCode() for any two sets s1 and s2, as required by the general contract of Object.hashCode().

An alternative to using the sum of the entries' hash codes would be to use, for example, the ^ (XOR) operator.

The Scala language uses an ordering-invariant version of the Murmurhash algorithm (cf. the private scala.util.hashing.MurmurHash3 class) to implement the hashCode (or ##) method of its immutable sets and similar collections.

like image 155
Dirk Avatar answered Sep 28 '22 03:09

Dirk