Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing Long.bitCount

I have a program that is making a huge number of calls to Long.bitCount(), so many that it is taking 33% of cycles on one CPU core. Is there a way to implement it that is faster than the Sun JDK version?

I have tried:

  • This algorithm (I think this is exactly how the JDK implements it)
  • lookup tables of various sizes between 28 and 222 (looking at a few bits at a time and adding the results)

But I couldn't do any better than a 216-entry lookup table with a manually-unrolled loop (about 27% CPU.)
How else might this be optimized for Java?


Note: this question is about Java-specific optimization, but this similar (language-agnostic) question has many other algorithms.

like image 606
finnw Avatar asked Jan 29 '11 20:01

finnw


4 Answers

If you are on a recent x86 CPU there is an instruction for this, popcnt.

In recent versions of Java, Long.bitCount() uses this instruction. Just use -XX:+UsePopCountInstruction (this is the default in recent versions)

However, there are some bugs with it in JRE 6.0_u18 through 7.0_u5: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7063674

like image 177
Scott Carey Avatar answered Nov 14 '22 16:11

Scott Carey


This seems like one of those problems that is simply perfect for the GPU to work on. It should be able to slash your time by a couple orders of magnitude.

Otherwise I think you may have to deal with it at a higher level. Having multiple threads working on different segments of data at a time (which I'm sure you already do), processing the data while you are collecting it, sharing the work around multiple systems--something like that.

like image 43
Bill K Avatar answered Nov 14 '22 16:11

Bill K


If you machine has an integer ALU that can process data wider than some multiples of 64 bits (also known as SIMD, such as SSE2 or VMX), you can compute the bit counts on several 64-bit elements at once.

Unfortunately, this will require you to provide machine-specific implementations in a lower-level language than Java.

like image 4
rwong Avatar answered Nov 14 '22 14:11

rwong


I suspect that your app is memory-bound rather than CPU-bound, i.e. it spends more time fetching the values from memory than counting their bits. In that case you should try to reduce the size of the working set or improve access locality to reduce cache misses (if the algorithm allows it).

like image 2
Michael Borgwardt Avatar answered Nov 14 '22 14:11

Michael Borgwardt