Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java JDK BitSet vs Lucene OpenBitSet

I was trying to implement a BloomFilter and came across some discussions regarding BitSets. The Lucene OpenBitSet claims that it is faster than the Java BitSet implementation in almost all of the operations.

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.10.4/org/apache/lucene/util/OpenBitSet.java#OpenBitSet

I tried to look at the code for both the implementations.

Java BitSet code

http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/BitSet.java#BitSet

It seems to me that both these classes use an array of 'long' to store the bits. Individual bits are mapped to a particular array index and a bit position in the 'long' value stored at the index.

What is the reason, then that the OpenBitSet implementation is far better in terms of performance ? Where is the difference in the code that leads to this improvement in speed ?

like image 566
sriram manoj Avatar asked May 06 '16 19:05

sriram manoj


People also ask

Why BitSet is used in Java?

The BitSet class creates a special type of array that holds bit values. The BitSet array can increase in size as needed. This makes it similar to a vector of bits.

What is BitSet used for?

Bitset represents a fixed-size sequence of N bits and stores values either 0 or 1. Zero means value is false or bit is unset and one means value is true or bit is set. Bitset class emulates space efficient array of boolean values, where each element occupies only one bit.


1 Answers

Ok, that's how you approach such things.

When someone claims that his implementation is 2-3x faster with common phrases such as "maximum code reuse", "no extra safety" etc. and doesn't provide any real benchmark, you should raise the red flag in your head. Indeed, all benchmarks in their mail lists/docs have no source code and written (according to results) by hand (so probably violating benchmarking rules) instead of using JMH.

Before hand-waving why something is faster than something else let's write a benchmark and see if it's really faster before making any statements. Code of benchmark is here: it just tests all basic operations for sets of size 1024 and 1024 * 1024 (~1kk) with fill factor 50%. Tests are run on Intel Core i7-4870HQ CPU @ 2.50GHz. Score is throughput, the higher the better.

The whole benchmark looks like this:

  @Benchmark
  public boolean getClassic(BitSetState state) {
      return state.bitSet.get(state.nextIndex);
  }

  @Benchmark
  public boolean getOpen(BitSetState state) {
      return state.openBitSet.get(state.nextIndex);
  }

  @Benchmark
  public boolean getOpenFast(BitSetState state) {
      return state.openBitSet.fastGet(state.nextIndex);
  }

Ok, let's see the results:

Benchmark                           (setSize)   Mode  Cnt    Score    Error   Units
BitSetBenchmark.andClassic               1024  thrpt    5  109.541 ± 46.361  ops/us
BitSetBenchmark.andOpen                  1024  thrpt    5  111.039 ±  9.648  ops/us
BitSetBenchmark.cardinalityClassic       1024  thrpt    5   93.509 ± 10.943  ops/us
BitSetBenchmark.cardinalityOpen          1024  thrpt    5   29.216 ±  4.824  ops/us
BitSetBenchmark.getClassic               1024  thrpt    5  291.944 ± 46.907  ops/us
BitSetBenchmark.getOpen                  1024  thrpt    5  245.023 ± 75.144  ops/us
BitSetBenchmark.getOpenFast              1024  thrpt    5  228.563 ± 91.933  ops/us
BitSetBenchmark.orClassic                1024  thrpt    5  121.070 ± 12.220  ops/us
BitSetBenchmark.orOpen                   1024  thrpt    5  107.612 ± 16.579  ops/us
BitSetBenchmark.setClassic               1024  thrpt    5  527.291 ± 26.895  ops/us
BitSetBenchmark.setNextClassic           1024  thrpt    5  592.465 ± 34.926  ops/us
BitSetBenchmark.setNextOpen              1024  thrpt    5  575.186 ± 33.459  ops/us
BitSetBenchmark.setOpen                  1024  thrpt    5  527.568 ± 46.240  ops/us
BitSetBenchmark.setOpenFast              1024  thrpt    5  522.131 ± 54.856  ops/us



Benchmark                           (setSize)   Mode  Cnt    Score    Error   Units
BitSetBenchmark.andClassic            1232896  thrpt    5    0.111 ±  0.009  ops/us
BitSetBenchmark.andOpen               1232896  thrpt    5    0.131 ±  0.010  ops/us
BitSetBenchmark.cardinalityClassic    1232896  thrpt    5    0.174 ±  0.012  ops/us
BitSetBenchmark.cardinalityOpen       1232896  thrpt    5    0.049 ±  0.004  ops/us
BitSetBenchmark.getClassic            1232896  thrpt    5  298.027 ± 40.317  ops/us
BitSetBenchmark.getOpen               1232896  thrpt    5  243.472 ± 87.491  ops/us
BitSetBenchmark.getOpenFast           1232896  thrpt    5  248.743 ± 79.071  ops/us
BitSetBenchmark.orClassic             1232896  thrpt    5    0.135 ±  0.017  ops/us
BitSetBenchmark.orOpen                1232896  thrpt    5    0.131 ±  0.021  ops/us
BitSetBenchmark.setClassic            1232896  thrpt    5  525.137 ± 11.849  ops/us
BitSetBenchmark.setNextClassic        1232896  thrpt    5  597.890 ± 51.158  ops/us
BitSetBenchmark.setNextOpen           1232896  thrpt    5  485.154 ± 63.016  ops/us
BitSetBenchmark.setOpen               1232896  thrpt    5  524.989 ± 27.977  ops/us
BitSetBenchmark.setOpenFast           1232896  thrpt    5  532.943 ± 74.671  ops/us

Surprising, isn't it? What can we learn from results?

  • Get and set (including fast versions) are equal in terms of performance. Their results lie in the same error bounds, it's hard to tell any difference without proper nanobenchmarking, so in terms of using bitset in typical application implementation doesn't make any difference and one more if branch doesn't matter. So statement about OpenBitSet get/set better performance is false. UPD: nanobenchmark of get methods doesn't show any difference either, results are here.
  • Cardinality of BitSet can be calculated much faster (~3 times for both 1k and 1kk sizes), so statement about "ultra fast cardinality" is false. But numbers are meaningless without actual answer why performance differs, so let's dig a bit. To count bits in words BitSet uses Long#bitCount which is Hotspot intrinsic. It means that the whole bitCount method will be compiled into single instruction (for the curious ones it will be x86 popcnt). While OpenBitSet uses hand-rolled bit-counting using tricks from Hacker's Delight (see org.apache.lucene.util.BitUtil#pop_array). No wonder why classic version is faster now.
  • Group set methods like and/or are both the same, so no performance win here. But interesting thing: BitSet implementation tracks maximum index of word where at least one bit is set and perform and/or/cardinality operations only in bounds of [0, maxIndex], so we can compare specific cases, when set has only first 1/10/50% bits set and the rest is not (with same fill factor 50% for given part). Then BitSet performance should differ, while OpenBitSet stay the same. Let's validate (benchmark code):

    Benchmark                   (fillFactor)  (setSize)   Mode  Cnt   Score    Error   Units
    BitSetBenchmark.andClassic          0.01    1232896  thrpt    5  32.036 ±  1.320  ops/us
    BitSetBenchmark.andClassic           0.1    1232896  thrpt    5   3.824 ±  0.896  ops/us
    BitSetBenchmark.andClassic           0.5    1232896  thrpt    5   0.330 ±  0.027  ops/us
    BitSetBenchmark.andClassic             1    1232896  thrpt    5   0.140 ±  0.017  ops/us
    BitSetBenchmark.andOpen             0.01    1232896  thrpt    5   0.142 ±  0.008  ops/us
    BitSetBenchmark.andOpen              0.1    1232896  thrpt    5   0.128 ±  0.015  ops/us
    BitSetBenchmark.andOpen              0.5    1232896  thrpt    5   0.112 ±  0.015  ops/us
    BitSetBenchmark.andOpen                1    1232896  thrpt    5   0.132 ±  0.018  ops/us
    BitSetBenchmark.orClassic           0.01    1232896  thrpt    5  27.826 ± 13.312  ops/us
    BitSetBenchmark.orClassic            0.1    1232896  thrpt    5   3.727 ±  1.161  ops/us
    BitSetBenchmark.orClassic            0.5    1232896  thrpt    5   0.342 ±  0.022  ops/us
    BitSetBenchmark.orClassic              1    1232896  thrpt    5   0.133 ±  0.021  ops/us
    BitSetBenchmark.orOpen              0.01    1232896  thrpt    5   0.133 ±  0.009  ops/us
    BitSetBenchmark.orOpen               0.1    1232896  thrpt    5   0.118 ±  0.007  ops/us
    BitSetBenchmark.orOpen               0.5    1232896  thrpt    5   0.127 ±  0.018  ops/us
    BitSetBenchmark.orOpen                 1    1232896  thrpt    5   0.148 ±  0.023  ops/us
    

The lower part of set is filled, the faster BitSet is and when bits distributed uniformly, then performance of BitSet and OpenBitSet becomes equal, theory confirmed. So for specific non-uniform set bits distributions classic BitSet is faster for group operations. Statement about very fast group operations in OpenBitSet is false.


Summary

This answer and benchmarks don't intend to show that OpenBitSet is bad or that authors are liars. Indeed, according to their benchmark machines (AMD Opteron and Pentium 4) and Java version (1.5) it's easy to believe that earlier BitSet was less optimized, Hotspot compiler wasn't very smart, popcnt instruction didn't exist and then OpenBitSet was a good idea and was much more performant. Moreover, BitSet doesn't expose its internal word array, so it's impossible to create custom fine-grained synchronized bitset or flexible serialization and that's what Lucene was needed. So for Lucene it's still a reasonable choice, while for typical users its better to use standard BitSet, which is faster (in some cases, not generally) and belongs to standard library. Time changes, old performance results changes, so always benchmark and validate your specific cases, maybe for some of them (e.g. not benchmarked iterator or different set fill factor) OpenBitSet will be faster.

like image 138
qwwdfsad Avatar answered Sep 22 '22 08:09

qwwdfsad