I am developing a small neural network whose parameters need a lot of optimization, so a lot of processing time. I have profiled my script with cProfile
and what takes 80% of the processor time is the NumPy dot
function, the rest is matrix inversion with the function numpy.linalg.solve
.
My current version of numpy uses blas
, or it is what it seems, since numpy.core._dotblas.dot
appears as the function that takes 80% of the total time of processing.
As it is the core of my neural network and as I have to run this a lot, any minor speed gain could save me a lot of time over the numerous repeated parameters optimizations.
More precisions: the matrix multiplication is on matrices that have a shape of minimum 100*100 up to 500*500. I have a computer with 12 cores and use them so far to run different neural network parameters optimization in parallel, but maybe the matrix multiplication could be done in parallel?
Thank you for your time!
Answer:
I spent few days testing and installing uninstalling libraries... Here is the result of what I tested: By default on my version of Ubuntu (12.04) and respository installed version of Numpy, the BLAS libraries are ATLAS libraries. I made some tests that reflect the improvement SPECIFICALLY on the computations I am interested in, so these results must not be interpreted as the final answer. These computations involve a matrix multiplication (dot product) in a 55000 iterations loop, with a 500*500 and 1000*1000 matrix. I use a HP Z800 workstation with a Xeon X5675 @ 3.07GHZ with 12 cores. All the results (percentage) are the comparison between the described condition and the reference which here is the packaged ATLAS library.
Scipy.sparse module
: I don't know if I set it correctly but with a 10% sparseness, using this module becomes useful starting from 1500*1500 matrices with OpenBLAS and MKL. If you have suggestion about how to use them properly I am interested!Step1: input two matrix. Step 2: nested for loops to iterate through each row and each column. Step 3: take one resultant matrix which is initially contains all 0. Then we multiply each row elements of first matrix with each elements of second matrix, then add all multiplied value.
matmul and both outperform np. dot . Also note, as explained in the docs, np.
The numpy dot() function returns the dot product of two arrays. The result is the same as the matmul() function for one-dimensional and two-dimensional arrays.
If you're not already, you could try linking numpy to a very optimized BLAS library like Intel MKL (which is free-as-in-beer for non-commercial use or discounted for academic use, which apparently doesn't count as non-commercial; instructions from Intel for using it with numpy) or OpenBLAS (free-as-in-speech). There's also the Enthought Python Distribution, which is pre-linked to MKL and free-as-in-beer for academics. That can parallelize your matrix multiplications automatically and can be much faster than the typical reference BLAS / ATLAS installation on most Linux distros, or whatever it is you're using.
Otherwise, the only thing I know of that you could do would be some mathematical tricks to not have to compute as many multiplications / solves. Without knowing exactly what you're doing it's hard to give any suggestions there.
I'm assuming that your matrices are dense, since they usually are in neural nets, but if you're doing something unusual scipy.sparse
might help too.
Numpy uses really fast internal algorithms and representations based on third-party libraries (such as BLAS, as you named it) already using SSE optimizations among others. Because the original BLAS is a tad slow (because it aims to be a reference implementation, focusing on precision rather than performance), you may wish to use another flavor focused on performance, such as OpenBLAS. To use OpenBLAS, you need to either find a pre-built OpenBLAS-enabled Numpy package or recompile a version linked against OpenBLAS. Once you are using an efficient BLAS implementation, you won't find a better speedup option in pure python, unless you write a library in C and take much time to optimize it.
On the other hand, you can check if your Numpy and BLAS library are compiled as efficiently as possible on your architecture. For instance, if you can activate the OpenMP library on Numpy compilation, it would allow multiple cores to work on your problem using data-level parallelism. This can be a significant source of speedup if you possess multiple cores on your computer and your computations are CPU-bound. If your kind of problem allows it, you could even use a task-based parallel programming library (SCOOP [Disclamer: I wrote it], Celery, etc.) to propagate your work on multiple computers.
As a last resort, another possibility would be to buy new hardware. It makes software potentially go faster without changing a single line of code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With