Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of Java matrix math libraries? [closed]

I'm the author of Java Matrix Benchmark (JMatBench) and I'll give my thoughts on this discussion.

There are significant difference between Java libraries and while there is no clear winner across the whole range of operations, there are a few clear leaders as can be seen in the latest performance results (October 2013).

If you are working with "large" matrices and can use native libraries, then the clear winner (about 3.5x faster) is MTJ with system optimised netlib. If you need a pure Java solution then MTJ, OjAlgo, EJML and Parallel Colt are good choices. For small matrices EJML is the clear winner.

The libraries I did not mention showed significant performance issues or were missing key features.


Just to add my 2 cents. I've compared some of these libraries. I attempted to matrix multiply a 3000 by 3000 matrix of doubles with itself. The results are as follows.

Using multithreaded ATLAS with C/C++, Octave, Python and R, the time taken was around 4 seconds.

Using Jama with Java, the time taken was 50 seconds.

Using Colt and Parallel Colt with Java, the time taken was 150 seconds!

Using JBLAS with Java, the time taken was again around 4 seconds as JBLAS uses multithreaded ATLAS.

So for me it was clear that the Java libraries didn't perform too well. However if someone has to code in Java, then the best option is JBLAS. Jama, Colt and Parallel Colt are not fast.


I'm the main author of jblas and wanted to point out that I've released Version 1.0 in late December 2009. I worked a lot on the packaging, meaning that you can now just download a "fat jar" with ATLAS and JNI libraries for Windows, Linux, Mac OS X, 32 and 64 bit (except for Windows). This way you will get the native performance just by adding the jar file to your classpath. Check it out at http://jblas.org!


I just compared Apache Commons Math with jlapack.

Test: singular value decomposition of a random 1024x1024 matrix.

Machine: Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz, linux x64

Octave code: A=rand(1024); tic;[U,S,V]=svd(A);toc

results                                execution time
---------------------------------------------------------
Octave                                 36.34 sec

JDK 1.7u2 64bit
    jlapack dgesvd                     37.78 sec
    apache commons math SVD            42.24 sec


JDK 1.6u30 64bit
    jlapack dgesvd                     48.68 sec
    apache commons math SVD            50.59 sec

Native routines
Lapack* invoked from C:                37.64 sec
Intel MKL                               6.89 sec(!)

My conclusion is that jlapack called from JDK 1.7 is very close to the native binary performance of lapack. I used the lapack binary library coming with linux distro and invoked the dgesvd routine to get the U,S and VT matrices as well. All tests were done using double precision on exactly the same matrix each run (except Octave).

Disclaimer - I'm not an expert in linear algebra, not affiliated to any of the libraries above and this is not a rigorous benchmark. It's a 'home-made' test, as I was interested comparing the performance increase of JDK 1.7 to 1.6 as well as commons math SVD to jlapack.


I can't really comment on specific libraries, but in principle there's little reason for such operations to be slower in Java. Hotspot generally does the kinds of things you'd expect a compiler to do: it compiles basic math operations on Java variables to corresponding machine instructions (it uses SSE instructions, but only one per operation); accesses to elements of an array are compiled to use "raw" MOV instructions as you'd expect; it makes decisions on how to allocate variables to registers when it can; it re-orders instructions to take advantage of processor architecture... A possible exception is that as I mentioned, Hotspot will only perform one operation per SSE instruction; in principle you could have a fantastically optimised matrix library that performed multiple operations per instruction, although I don't know if, say, your particular FORTRAN library does so or if such a library even exists. If it does, there's currently no way for Java (or at least, Hotspot) to compete with that (though you could of course write your own native library with those optimisations to call from Java).

So what does all this mean? Well:

  • in principle, it is worth hunting around for a better-performing library, though unfortunately I can't recomend one
  • if performance is really critical to you, I would consider just coding your own matrix operations, because you may then be able perform certain optimisations that a library generally can't, or that a particular library your using doesn't (if you have a multiprocessor machine, find out if the library is actually multithreaded)

A hindrance to matrix operations is often data locality issues that arise when you need to traverse both row by row and column by column, e.g. in matrix multiplication, since you have to store the data in an order that optimises one or the other. But if you hand-write the code, you can sometimes combine operations to optimise data locality (e.g. if you're multiplying a matrix by its transformation, you can turn a column traversal into a row traversal if you write a dedicated function instead of combining two library functions). As usual in life, a library will give you non-optimal performance in exchange for faster development; you need to decide just how important performance is to you.


Jeigen https://github.com/hughperkins/jeigen

  • wraps Eigen C++ library http://eigen.tuxfamily.org , which is one of the fastest free C++ libraries available
  • relatively terse syntax, eg 'mmul', 'sub'
  • handles both dense and sparse matrices

A quick test, by multiplying two dense matrices, ie:

import static jeigen.MatrixUtil.*;

int K = 100;
int N = 100000;
DenseMatrix A = rand(N, K);
DenseMatrix B = rand(K, N);
Timer timer = new Timer();
DenseMatrix C = B.mmul(A);
timer.printTimeCheckMilliseconds();

Results:

Jama: 4090 ms
Jblas: 1594 ms
Ojalgo: 2381 ms (using two threads)
Jeigen: 2514 ms
  • Compared to jama, everything is faster :-P
  • Compared to jblas, Jeigen is not quite as fast, but it handles sparse matrices.
  • Compared to ojalgo, Jeigen takes about the same amount of elapsed time, but only using one core, so Jeigen uses half the total cpu. Jeigen has a terser syntax, ie 'mmul' versus 'multiplyRight'