Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient hash code for multiset in Java

I have defined a subinterface of java.util.Collection that effectively is a multiset (aka bag). It may not contain null elements, although that's not crucial to my question. The equals contract defined by the interface is as you would expect:

  • obj instanceof MyInterface
  • obj contains the same elements as this (by equals)
  • obj contains the same number of duplicates for each element
  • order of elements is disregarded

Now I want to write my hashCode method. My initial idea was:

int hashCode = 1;
for( Object o : this ) {
    hashCode += o.hashCode();
}

However, I noticed that com.google.common.collect.Multiset (from Guava) defines the hash code as follows:

int hashCode = 0;
for( Object o : elementSet() ) {
    hashCode += ((o == null) ? 0 : o.hashCode()) ^ count(o);
}

It strikes me as odd that an empty Multiset would have hash code 0, but more importantly I don't understand the benefit of ^ count(o) over simply adding up the hash codes of every duplicate. Maybe it's about not calculating the same hash code more than once, but then why not * count(o)?

My question: what would be an efficient hash code calculation? In my case the count for an element is not guaranteed to be cheap to obtain.

like image 407
Rinke Avatar asked Sep 16 '11 00:09

Rinke


3 Answers

Update

Let's say, for example, in the case where we'd have an array that we want to treat as a multiset.

So you have to process all the entries as they come, you can't use count, and can't assume the entries come in a known order.

The general function I'd consider is

int hashCode() {
    int x = INITIAL_VALUE;
    for (Object o : this) {
        x = f(x, o==null ? NULL_HASH : g(o.hashCode()));
    }
    return h(x);
}

Some observations:

  • As already stated in the other answers, the INITIAL_VALUE doesn't matter much.
  • I wouldn't go for NULL_HASH=0 since this would ignore null values.
  • The function g can be used in case you expect the hashes of the members to be in a small range (which can happen in case they are e.g., single characters).
  • The function h can be used to improve the result, which is not very important since this already happens e.g. in HashMap.hash(int).
  • The function f is the most important one, unfortunately, it's quite limited as it obviously must be both associative and commutative.
  • The function f should be bijective in both arguments, otherwise you'd generate unnecessary collisions.

In no case I'd recommend f(x, y) = x^y since it'd make two occurrences of an element to cancel out. Using addition is better. Something like

f(x, y) = x + (2*A*x + 1) * y

where A is a constant satisfies all the above conditions. It may be worth it. For A=0 it degenerates to addition, using an even A is not good as it shift bits of x*y out. Using A=1 is fine, and the expression 2*x+1 can be computed using a single instruction on the x86 architecture. Using a larger odd A might work better in case the hashes of the members are badly distributed.

In case you go for a non-trivial hashCode() you should test if it works correctly. You should measure the performance of your program, maybe you'll find simple addition sufficient. Otherwise, I'd for for NULL_HASH=1, g=h=identity, and A=1.

My old answer

It may be for efficiency reasons. Calling count may be expensive for some implementations, but entrySet may be used instead. Still it might be more costly, I can't tell.

I did a simple collision benchmark for Guava's hashCode and Rinke's and my own proposals:

enum HashCodeMethod {
    GUAVA {
        @Override
        public int hashCode(Multiset<?> multiset) {
            return multiset.hashCode();
        }
    },
    RINKE {
        @Override
        public int hashCode(Multiset<?> multiset) {
            int result = 0;
            for (final Object o : multiset.elementSet()) {
                result += (o==null ? 0 : o.hashCode()) * multiset.count(o);
            }
            return result;
        }
    },
    MAAARTIN {
        @Override
        public int hashCode(Multiset<?> multiset) {
            int result = 0;
            for (final Multiset.Entry<?> e : multiset.entrySet()) {
                result += (e.getElement()==null ? 0 : e.getElement().hashCode()) * (2*e.getCount()+123);
            }
            return result;
        }
    }
    ;
    public abstract int hashCode(Multiset<?> multiset);
}

The collision counting code went as follows:

private void countCollisions() throws Exception {
    final String letters1 = "abcdefgh";
    final String letters2 = "ABCDEFGH";
    final int total = letters1.length() * letters2.length();
    for (final HashCodeMethod hcm : HashCodeMethod.values()) {
        final Multiset<Integer> histogram = HashMultiset.create();
        for (final String s1 : Splitter.fixedLength(1).split(letters1)) {
            for (final String s2 : Splitter.fixedLength(1).split(letters2)) {
                histogram.add(hcm.hashCode(ImmutableMultiset.of(s1, s2, s2)));
            }
        }
        System.out.println("Collisions " + hcm + ": " + (total-histogram.elementSet().size()));
    }
}

and printed

Collisions GUAVA: 45
Collisions RINKE: 42
Collisions MAAARTIN: 0

So in this simple example Guava's hashCode performed really bad (45 collisions out of 63 possible). However, I don't claim my example is of much relevance for real life.

like image 144
maaartinus Avatar answered Nov 06 '22 13:11

maaartinus


If count is expensive, don't do it. Do you know it's too expensive? You can always code several implementations and profile their performance with data you expect to be representative of your application. Then you'll know the answer instead of guessing.

As for why you use XOR, see 'Calculating Aggregate hashCodes with XOR'.

like image 2
Nick Veys Avatar answered Nov 06 '22 13:11

Nick Veys


It strikes me as odd that an empty Multiset would have hash code 0

Why? All empty collections probably have hash code 0. Even if not, it would have to be a fixed value (since all empty collections are equal), so what is wrong with 0?

what would be an efficient hash code calculation?

Yours is more efficient (which just means faster to calculate), not too that bad in terms of effectiveness (which means yielding results that work well), either. If I understand it correctly, it adds up the hash codes of all elements (with duplicate elements being added twice). This is exactly what a regular Set does, so if you have no duplicates, you get the same hashCode as with a Set, which could be an advantage (if you fix the empty set to have hashCode 0, not 1).

Google's version is a little more complicated, I suppose in order to avoid some otherwise frequent collisions. Of course it probably causes some other collisions that are considered less frequent to happen instead.

In particular, using XOR spreads the hashCodes all over the available range, even if the individual input hashCodes do not (which they for example do not for Integers from a limited range, which is a frequent use-case).

Consider the hashCode for the Set [ 1, 2, 3]. It is 6. Likely to collide with similar Sets, for example [ 6], [ 4, 2] , [5, 1]. Throwing some XOR in there helps. If it is necessary and worth the extra cost is a tradeoff you have to make.

like image 2
Thilo Avatar answered Nov 06 '22 14:11

Thilo