Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I dump java.util.HashSet in favor of CompactHashSet? [closed]

I found that there is an implementation of a Set that uses hashes (with all the useful consequences, like O(1) for contains() etc) that is claimed to be more efficient than java.util.HashSet in every aspect:

http://ontopia.wordpress.com/2009/09/23/a-faster-and-more-compact-set/

http://alias-i.com/lingpipe/docs/api/com/aliasi/util/CompactHashSet.html

Would it then be a good idea to quit using java.util.HashSet completely wherever I need a java.util.Set in favor of com.aliasi.util.CompactHashSet?

like image 990
gvlasov Avatar asked Oct 14 '14 16:10

gvlasov


3 Answers

Let's start a little benchmark game.

Benchmarks are based on benchmarks from the original article, but use modern tools.

package tests;

import com.carrotsearch.hppc.ObjectOpenHashSet;
import com.carrotsearch.hppc.cursors.ObjectCursor;
import com.google.common.collect.GuavaCompactHashSet;
import net.ontopia.utils.CompactHashSet;
import net.openhft.koloboke.collect.set.hash.HashObjSet;
import net.openhft.koloboke.collect.set.hash.HashObjSets;
import org.openjdk.jmh.annotations.*;

import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.TimeUnit;
import java.util.function.Consumer;

import static java.util.Arrays.stream;
import static org.openjdk.jol.info.GraphLayout.parseInstance;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@OperationsPerInvocation(TestHashSet.TIMES)
@Threads(1)
@Fork(1)
@State(Scope.Thread)
public class TestHashSet {
    public static final int TIMES = 1000000;
    private static final int MAX = 5000000;
    private static long ELEMENTS_SIZE;

    static Long[] add = new Long[TIMES], lookup = new Long[TIMES], remove = new Long[TIMES];
    static {
        for (int ix = 0; ix < TIMES; ix++)
            add[ix] = new Long(Math.round(Math.random() * MAX));
        ELEMENTS_SIZE = stream(add).distinct().count() * parseInstance(add[0]).totalSize();
        for (int ix = 0; ix < TIMES; ix++)
            lookup[ix] = new Long(Math.round(Math.random() * MAX));
        for (int ix = 0; ix < TIMES; ix++)
            remove[ix] = new Long(Math.round(Math.random() * MAX));
    }

    @Benchmark
    public int hashSet() {
        Set<Long> set = new HashSet<Long>();
        for (Long o : add) {
            set.add(o);
        }
        int r = 0;
        for (Long o : lookup) {
            r ^= set.contains(o) ? 1 : 0;
        }
        for (Long o : set) {
            r += o.intValue();
        }
        for (Long o : remove) {
            set.remove(o);
        }
        return r + set.size();
    }

    @Benchmark
    public int compactHashSet() {
        Set<Long> set = new CompactHashSet<Long>();
        for (Long o : add) {
            set.add(o);
        }
        int r = 0;
        for (Long o : lookup) {
            r ^= set.contains(o) ? 1 : 0;
        }
        for (Long o : set) {
            r += o.intValue();
        }
        for (Long o : remove) {
            set.remove(o);
        }
        return r + set.size();
    }

    @Benchmark
    public int hppcSet() {
        ObjectOpenHashSet<Long> set = new ObjectOpenHashSet<Long>();
        for (Long o : add) {
            set.add(o);
        }
        int r = 0;
        for (Long o : lookup) {
            r ^= set.contains(o) ? 1 : 0;
        }
        for (ObjectCursor<Long> cur : set) {
            r += cur.value.intValue();
        }
        for (Long o : remove) {
            set.remove(o);
        }
        return r + set.size();
    }

    @Benchmark
    public int kolobokeSet() {
        Set<Long> set = HashObjSets.newMutableSet();
        for (Long o : add) {
            set.add(o);
        }
        int r = 0;
        for (Long o : lookup) {
            r ^= set.contains(o) ? 1 : 0;
        }
        for (Long o : set) {
            r += o.intValue();
        }
        for (Long o : remove) {
            set.remove(o);
        }
        return r + set.size();
    }

    @Benchmark
    public int guavaCompactHashSet() {
        // fair comparison -- growing table
        Set<Long> set = new GuavaCompactHashSet<>(10);
        for (Long o : add) {
            set.add(o);
        }
        int r = 0;
        for (Long o : lookup) {
            r ^= set.contains(o) ? 1 : 0;
        }
        for (Long o : set) {
            r += o.intValue();
        }
        for (Long o : remove) {
            set.remove(o);
        }
        return r + set.size();
    }

    public static void main(String[] argv) {
        HashSet hashSet = new HashSet();
        test("HashSet", hashSet, hashSet::add);
        CompactHashSet compactHashSet = new CompactHashSet();
        test("CompactHashSet", compactHashSet, compactHashSet::add);
        HashObjSet<Object> kolobokeSet = HashObjSets.newMutableSet();
        test("KolobokeSet", kolobokeSet, kolobokeSet::add);
        ObjectOpenHashSet hppcSet = new ObjectOpenHashSet();
        test("HPPC set", hppcSet, hppcSet::add);
        GuavaCompactHashSet guavaCompactHashSet = new GuavaCompactHashSet(10);
        test("GuavaCompactHashSet", guavaCompactHashSet, guavaCompactHashSet::add);
    }

    public static void test(String name, Object set, Consumer setAdd) {
        for (Long o : add) {
            setAdd.accept(o);
        }
        System.out.printf("%s: %.1f bytes per element\n", name,
                ((parseInstance(set).totalSize() - ELEMENTS_SIZE) * 1.0 / TIMES));

    }
}

Results:

Set implementation   Speed           Memory footprint
                     Score Units     +UCOops -UseCompressedOops
CompactHashSet       828   ns/op     8.4     16.8    bytes/elem
HashSet              676   ns/op     37.4    60.3    bytes/elem
HPPC Set             853   ns/op     10.5    18.9    bytes/elem
Koloboke Set         587   ns/op     8.4     16.8    bytes/elem
GuavaCompactHashSet  874   ns/op     25.9    37.4    bytes/elem

Appears that CompactHashSet is even slower than old good HashSet, despite it uses much less memory.

like image 87
leventov Avatar answered Nov 13 '22 23:11

leventov


It depends.

Are you dealing with very large Sets and many insert or read operations? This new implementation cut the time in half for a million operations. Thats a great improvement, but if you're only doing a few thousand operations or a dozen then this quickly turns into a micro optimization.

The tests shown are also inserting a Long into the set. The performance for both runtime and memory usage may change if you're storing something else in the set.

If you have a use case that provably benefits from the change in a statistically significant way, then use it.

like image 3
Freiheit Avatar answered Nov 14 '22 00:11

Freiheit


Option 1: Don't care. If you look in the java HashSet implementation you discover that it simply uses a HashMap internally:

public class HashSet<E>
    extends AbstractSet<E>
    implements Set<E>, Cloneable, java.io.Serializable
{
   static final long serialVersionUID = -5024744406713321676L;

   private transient HashMap<E,Object> map;
....

That is a quick implementation, however, each set entry has a reference to a value, that is not needed. Hence the memory consumption. My first option is to "don't care", since I hope that sometime in the future someone will provide an improved HashSet in the JDK. Software engineers should always have hope and a positive attitude :)

Within normal program logic I always stick to the provided standards as much as possible and use what is available. This avoids the effect that each programmer uses its own "favorite Set implementation", or, even worse, does a lengthy research what is the actually best HashSet implementation to use ;)

Does Oracle have an open bug ticket for the poor HashMap? Cannot find one....

Option 2: Care. If you are not on business logic value but within some technical middleware code, then performance may matter. Then there are various options. The CompactHashMap within Google Guava is one. Another nice library is the High Performance Primitive Collections. Within HPPC you find sets for every primitive type also. I think you also will find other stuff that fits your particular purpose. Not every HashMap replacement may have the exact same semantics as the orginal HashMap.

So, I personally would never replace java.util.HashMap just "by default".

like image 3
cruftex Avatar answered Nov 13 '22 22:11

cruftex