Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java performance in numerical algorithms

I am curious about performance of Java numerical algorithms, say for example matrix matrix double precision multiplication, using the latest JIT machines as compared for example to hand tuned SSE C++/assembler or Fortran counterparts.

I have looked on the web but most of the results come from almost 10 years ago and I understand Java progressed quite a lot since then.

If you have experience using Java for numerically intensive applications can you share your experience. Also how well does Java perform in kernels where the loops are relatively short and the memory access is not very uniform but still within the limits of L1 cache? If such kernel is executed multiple times in succession, can JVM optimize it during runtime?

Thanks

like image 936
Anycorn Avatar asked Nov 09 '09 01:11

Anycorn


2 Answers

I have written some reasonably large and performance sensitive numerical code in Java (crunching large arrays of doubles usually).

I've found Java to be "good enough" for fast numerical calculations. Especially when you consider that you are usually not CPU-bound anyway - memory latency and cache awareness will probably be your biggest problem for large datasets.

However, you can still beat Java with hand-optimized C/C++ code that takes advantage of specific vectorised instructions etc. or highly customised memory layouts. So for the very fastest code, you could consider writing the core algorithm in C/C++ and calling it from Java using JNI.

Personally, I find that creating a native code dependency is usually more trouble than it is worth so I tend to stick to the pure Java approach.

like image 155
mikera Avatar answered Sep 22 '22 02:09

mikera


This is coming from a .NET side of things, but I'm 90% sure that it's the case for Java too. While the JIT will make some use of SSE instructions where it can, it currently does not auto-vectorize your code when dealing with, for example, matrix multiplications. Hand vectorized C++ using compiler intrinsics/inline assembly will definitely be faster here.

like image 29
JulianR Avatar answered Sep 20 '22 02:09

JulianR