Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to accelerate matrix multiplications in Python?

I am developing a small neural network whose parameters need a lot of optimization, so a lot of processing time. I have profiled my script with cProfile and what takes 80% of the processor time is the NumPy dot function, the rest is matrix inversion with the function numpy.linalg.solve. My current version of numpy uses blas, or it is what it seems, since numpy.core._dotblas.dot appears as the function that takes 80% of the total time of processing.

As it is the core of my neural network and as I have to run this a lot, any minor speed gain could save me a lot of time over the numerous repeated parameters optimizations.

More precisions: the matrix multiplication is on matrices that have a shape of minimum 100*100 up to 500*500. I have a computer with 12 cores and use them so far to run different neural network parameters optimization in parallel, but maybe the matrix multiplication could be done in parallel?

Thank you for your time!

Answer:

I spent few days testing and installing uninstalling libraries... Here is the result of what I tested: By default on my version of Ubuntu (12.04) and respository installed version of Numpy, the BLAS libraries are ATLAS libraries. I made some tests that reflect the improvement SPECIFICALLY on the computations I am interested in, so these results must not be interpreted as the final answer. These computations involve a matrix multiplication (dot product) in a 55000 iterations loop, with a 500*500 and 1000*1000 matrix. I use a HP Z800 workstation with a Xeon X5675 @ 3.07GHZ with 12 cores. All the results (percentage) are the comparison between the described condition and the reference which here is the packaged ATLAS library.

  • Scipy.sparse module: I don't know if I set it correctly but with a 10% sparseness, using this module becomes useful starting from 1500*1500 matrices with OpenBLAS and MKL. If you have suggestion about how to use them properly I am interested!
  • With OpenBlas I get speed increase of 33% for 500*500 matrices but 160% for 1000*1000. But with OpenBLAS, the scipy.sparse module does not perform better but worse in fact.
  • The big winner here is the MKL libraries. The acceleration goes up to 230% with 1000*1000 matrices from the original ATLAS libraries! For the 500*500 matrices, the acceleration is more modest (100%) but still very good. Furthermore with the compilation with OpenMP, matrix multiplications can run on my 12 processors and here it twice as fast than on one processor with MKL libraries. But it is a waste of processing power, it is much more efficient to use multiprocessing modules to run scripts/matrix-multiplications in parallel.
like image 372
PierreE Avatar asked Sep 02 '12 19:09

PierreE


People also ask

How do you multiply a matrix in Python?

Step1: input two matrix. Step 2: nested for loops to iterate through each row and each column. Step 3: take one resultant matrix which is initially contains all 0. Then we multiply each row elements of first matrix with each elements of second matrix, then add all multiplied value.

Is Matmul faster than dot?

matmul and both outperform np. dot . Also note, as explained in the docs, np.

Is NP dot () and NP Matmul () the same?

The numpy dot() function returns the dot product of two arrays. The result is the same as the matmul() function for one-dimensional and two-dimensional arrays.


2 Answers

If you're not already, you could try linking numpy to a very optimized BLAS library like Intel MKL (which is free-as-in-beer for non-commercial use or discounted for academic use, which apparently doesn't count as non-commercial; instructions from Intel for using it with numpy) or OpenBLAS (free-as-in-speech). There's also the Enthought Python Distribution, which is pre-linked to MKL and free-as-in-beer for academics. That can parallelize your matrix multiplications automatically and can be much faster than the typical reference BLAS / ATLAS installation on most Linux distros, or whatever it is you're using.

Otherwise, the only thing I know of that you could do would be some mathematical tricks to not have to compute as many multiplications / solves. Without knowing exactly what you're doing it's hard to give any suggestions there.

I'm assuming that your matrices are dense, since they usually are in neural nets, but if you're doing something unusual scipy.sparse might help too.

like image 101
Danica Avatar answered Sep 20 '22 10:09

Danica


Numpy uses really fast internal algorithms and representations based on third-party libraries (such as BLAS, as you named it) already using SSE optimizations among others. Because the original BLAS is a tad slow (because it aims to be a reference implementation, focusing on precision rather than performance), you may wish to use another flavor focused on performance, such as OpenBLAS. To use OpenBLAS, you need to either find a pre-built OpenBLAS-enabled Numpy package or recompile a version linked against OpenBLAS. Once you are using an efficient BLAS implementation, you won't find a better speedup option in pure python, unless you write a library in C and take much time to optimize it.

On the other hand, you can check if your Numpy and BLAS library are compiled as efficiently as possible on your architecture. For instance, if you can activate the OpenMP library on Numpy compilation, it would allow multiple cores to work on your problem using data-level parallelism. This can be a significant source of speedup if you possess multiple cores on your computer and your computations are CPU-bound. If your kind of problem allows it, you could even use a task-based parallel programming library (SCOOP [Disclamer: I wrote it], Celery, etc.) to propagate your work on multiple computers.

As a last resort, another possibility would be to buy new hardware. It makes software potentially go faster without changing a single line of code.

like image 28
Soravux Avatar answered Sep 18 '22 10:09

Soravux