I found that there is an implementation of a Set
that uses hashes (with all the useful consequences, like O(1) for contains()
etc) that is claimed to be more efficient than java.util.HashSet
in every aspect:
http://ontopia.wordpress.com/2009/09/23/a-faster-and-more-compact-set/
http://alias-i.com/lingpipe/docs/api/com/aliasi/util/CompactHashSet.html
Would it then be a good idea to quit using java.util.HashSet
completely wherever I need a java.util.Set
in favor of com.aliasi.util.CompactHashSet
?
Let's start a little benchmark game.
Benchmarks are based on benchmarks from the original article, but use modern tools.
package tests;
import com.carrotsearch.hppc.ObjectOpenHashSet;
import com.carrotsearch.hppc.cursors.ObjectCursor;
import com.google.common.collect.GuavaCompactHashSet;
import net.ontopia.utils.CompactHashSet;
import net.openhft.koloboke.collect.set.hash.HashObjSet;
import net.openhft.koloboke.collect.set.hash.HashObjSets;
import org.openjdk.jmh.annotations.*;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.TimeUnit;
import java.util.function.Consumer;
import static java.util.Arrays.stream;
import static org.openjdk.jol.info.GraphLayout.parseInstance;
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@OperationsPerInvocation(TestHashSet.TIMES)
@Threads(1)
@Fork(1)
@State(Scope.Thread)
public class TestHashSet {
public static final int TIMES = 1000000;
private static final int MAX = 5000000;
private static long ELEMENTS_SIZE;
static Long[] add = new Long[TIMES], lookup = new Long[TIMES], remove = new Long[TIMES];
static {
for (int ix = 0; ix < TIMES; ix++)
add[ix] = new Long(Math.round(Math.random() * MAX));
ELEMENTS_SIZE = stream(add).distinct().count() * parseInstance(add[0]).totalSize();
for (int ix = 0; ix < TIMES; ix++)
lookup[ix] = new Long(Math.round(Math.random() * MAX));
for (int ix = 0; ix < TIMES; ix++)
remove[ix] = new Long(Math.round(Math.random() * MAX));
}
@Benchmark
public int hashSet() {
Set<Long> set = new HashSet<Long>();
for (Long o : add) {
set.add(o);
}
int r = 0;
for (Long o : lookup) {
r ^= set.contains(o) ? 1 : 0;
}
for (Long o : set) {
r += o.intValue();
}
for (Long o : remove) {
set.remove(o);
}
return r + set.size();
}
@Benchmark
public int compactHashSet() {
Set<Long> set = new CompactHashSet<Long>();
for (Long o : add) {
set.add(o);
}
int r = 0;
for (Long o : lookup) {
r ^= set.contains(o) ? 1 : 0;
}
for (Long o : set) {
r += o.intValue();
}
for (Long o : remove) {
set.remove(o);
}
return r + set.size();
}
@Benchmark
public int hppcSet() {
ObjectOpenHashSet<Long> set = new ObjectOpenHashSet<Long>();
for (Long o : add) {
set.add(o);
}
int r = 0;
for (Long o : lookup) {
r ^= set.contains(o) ? 1 : 0;
}
for (ObjectCursor<Long> cur : set) {
r += cur.value.intValue();
}
for (Long o : remove) {
set.remove(o);
}
return r + set.size();
}
@Benchmark
public int kolobokeSet() {
Set<Long> set = HashObjSets.newMutableSet();
for (Long o : add) {
set.add(o);
}
int r = 0;
for (Long o : lookup) {
r ^= set.contains(o) ? 1 : 0;
}
for (Long o : set) {
r += o.intValue();
}
for (Long o : remove) {
set.remove(o);
}
return r + set.size();
}
@Benchmark
public int guavaCompactHashSet() {
// fair comparison -- growing table
Set<Long> set = new GuavaCompactHashSet<>(10);
for (Long o : add) {
set.add(o);
}
int r = 0;
for (Long o : lookup) {
r ^= set.contains(o) ? 1 : 0;
}
for (Long o : set) {
r += o.intValue();
}
for (Long o : remove) {
set.remove(o);
}
return r + set.size();
}
public static void main(String[] argv) {
HashSet hashSet = new HashSet();
test("HashSet", hashSet, hashSet::add);
CompactHashSet compactHashSet = new CompactHashSet();
test("CompactHashSet", compactHashSet, compactHashSet::add);
HashObjSet<Object> kolobokeSet = HashObjSets.newMutableSet();
test("KolobokeSet", kolobokeSet, kolobokeSet::add);
ObjectOpenHashSet hppcSet = new ObjectOpenHashSet();
test("HPPC set", hppcSet, hppcSet::add);
GuavaCompactHashSet guavaCompactHashSet = new GuavaCompactHashSet(10);
test("GuavaCompactHashSet", guavaCompactHashSet, guavaCompactHashSet::add);
}
public static void test(String name, Object set, Consumer setAdd) {
for (Long o : add) {
setAdd.accept(o);
}
System.out.printf("%s: %.1f bytes per element\n", name,
((parseInstance(set).totalSize() - ELEMENTS_SIZE) * 1.0 / TIMES));
}
}
Results:
Set implementation Speed Memory footprint
Score Units +UCOops -UseCompressedOops
CompactHashSet 828 ns/op 8.4 16.8 bytes/elem
HashSet 676 ns/op 37.4 60.3 bytes/elem
HPPC Set 853 ns/op 10.5 18.9 bytes/elem
Koloboke Set 587 ns/op 8.4 16.8 bytes/elem
GuavaCompactHashSet 874 ns/op 25.9 37.4 bytes/elem
Appears that CompactHashSet
is even slower than old good HashSet
, despite it uses much less memory.
It depends.
Are you dealing with very large Sets and many insert or read operations? This new implementation cut the time in half for a million operations. Thats a great improvement, but if you're only doing a few thousand operations or a dozen then this quickly turns into a micro optimization.
The tests shown are also inserting a Long
into the set. The performance for both runtime and memory usage may change if you're storing something else in the set.
If you have a use case that provably benefits from the change in a statistically significant way, then use it.
Option 1: Don't care. If you look in the java HashSet implementation you discover that it simply uses a HashMap internally:
public class HashSet<E>
extends AbstractSet<E>
implements Set<E>, Cloneable, java.io.Serializable
{
static final long serialVersionUID = -5024744406713321676L;
private transient HashMap<E,Object> map;
....
That is a quick implementation, however, each set entry has a reference to a value, that is not needed. Hence the memory consumption. My first option is to "don't care", since I hope that sometime in the future someone will provide an improved HashSet in the JDK. Software engineers should always have hope and a positive attitude :)
Within normal program logic I always stick to the provided standards as much as possible and use what is available. This avoids the effect that each programmer uses its own "favorite Set implementation", or, even worse, does a lengthy research what is the actually best HashSet implementation to use ;)
Does Oracle have an open bug ticket for the poor HashMap? Cannot find one....
Option 2: Care. If you are not on business logic value but within some technical middleware code, then performance may matter. Then there are various options. The CompactHashMap within Google Guava is one. Another nice library is the High Performance Primitive Collections. Within HPPC you find sets for every primitive type also. I think you also will find other stuff that fits your particular purpose. Not every HashMap replacement may have the exact same semantics as the orginal HashMap.
So, I personally would never replace java.util.HashMap just "by default".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With