I have a program that is making a huge number of calls to Long.bitCount(), so many that it is taking 33% of cycles on one CPU core. Is there a way to implement it that is faster than the Sun JDK version?
I have tried:
But I couldn't do any better than a 216-entry lookup table with a manually-unrolled loop (about 27% CPU.)
How else might this be optimized for Java?
Note: this question is about Java-specific optimization, but this similar (language-agnostic) question has many other algorithms.
If you are on a recent x86 CPU there is an instruction for this, popcnt.
In recent versions of Java, Long.bitCount() uses this instruction. Just use -XX:+UsePopCountInstruction (this is the default in recent versions)
However, there are some bugs with it in JRE 6.0_u18 through 7.0_u5: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7063674
This seems like one of those problems that is simply perfect for the GPU to work on. It should be able to slash your time by a couple orders of magnitude.
Otherwise I think you may have to deal with it at a higher level. Having multiple threads working on different segments of data at a time (which I'm sure you already do), processing the data while you are collecting it, sharing the work around multiple systems--something like that.
If you machine has an integer ALU that can process data wider than some multiples of 64 bits (also known as SIMD, such as SSE2 or VMX), you can compute the bit counts on several 64-bit elements at once.
Unfortunately, this will require you to provide machine-specific implementations in a lower-level language than Java.
I suspect that your app is memory-bound rather than CPU-bound, i.e. it spends more time fetching the values from memory than counting their bits. In that case you should try to reduce the size of the working set or improve access locality to reduce cache misses (if the algorithm allows it).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With