How to accelerate matrix multiplications in Python?

Tags:

I am developing a small neural network whose parameters need a lot of optimization, so a lot of processing time. I have profiled my script with cProfile and what takes 80% of the processor time is the NumPy dot function, the rest is matrix inversion with the function numpy.linalg.solve. My current version of numpy uses blas, or it is what it seems, since numpy.core._dotblas.dot appears as the function that takes 80% of the total time of processing.

As it is the core of my neural network and as I have to run this a lot, any minor speed gain could save me a lot of time over the numerous repeated parameters optimizations.

More precisions: the matrix multiplication is on matrices that have a shape of minimum 100*100 up to 500*500. I have a computer with 12 cores and use them so far to run different neural network parameters optimization in parallel, but maybe the matrix multiplication could be done in parallel?

Thank you for your time!

Answer:

I spent few days testing and installing uninstalling libraries... Here is the result of what I tested: By default on my version of Ubuntu (12.04) and respository installed version of Numpy, the BLAS libraries are ATLAS libraries. I made some tests that reflect the improvement SPECIFICALLY on the computations I am interested in, so these results must not be interpreted as the final answer. These computations involve a matrix multiplication (dot product) in a 55000 iterations loop, with a 500*500 and 1000*1000 matrix. I use a HP Z800 workstation with a Xeon X5675 @ 3.07GHZ with 12 cores. All the results (percentage) are the comparison between the described condition and the reference which here is the packaged ATLAS library.

Scipy.sparse module: I don't know if I set it correctly but with a 10% sparseness, using this module becomes useful starting from 1500*1500 matrices with OpenBLAS and MKL. If you have suggestion about how to use them properly I am interested!
With OpenBlas I get speed increase of 33% for 500*500 matrices but 160% for 1000*1000. But with OpenBLAS, the scipy.sparse module does not perform better but worse in fact.
The big winner here is the MKL libraries. The acceleration goes up to 230% with 1000*1000 matrices from the original ATLAS libraries! For the 500*500 matrices, the acceleration is more modest (100%) but still very good. Furthermore with the compilation with OpenMP, matrix multiplications can run on my 12 processors and here it twice as fast than on one processor with MKL libraries. But it is a waste of processing power, it is much more efficient to use multiprocessing modules to run scripts/matrix-multiplications in parallel.

372

asked Sep 02 '12 19:09

PierreE

2 Answers

If you're not already, you could try linking numpy to a very optimized BLAS library like Intel MKL (which is free-as-in-beer for non-commercial use or discounted for academic use, which apparently doesn't count as non-commercial; instructions from Intel for using it with numpy) or OpenBLAS (free-as-in-speech). There's also the Enthought Python Distribution, which is pre-linked to MKL and free-as-in-beer for academics. That can parallelize your matrix multiplications automatically and can be much faster than the typical reference BLAS / ATLAS installation on most Linux distros, or whatever it is you're using.

Otherwise, the only thing I know of that you could do would be some mathematical tricks to not have to compute as many multiplications / solves. Without knowing exactly what you're doing it's hard to give any suggestions there.

I'm assuming that your matrices are dense, since they usually are in neural nets, but if you're doing something unusual scipy.sparse might help too.

101

answered Sep 20 '22 10:09

Danica

Numpy uses really fast internal algorithms and representations based on third-party libraries (such as BLAS, as you named it) already using SSE optimizations among others. Because the original BLAS is a tad slow (because it aims to be a reference implementation, focusing on precision rather than performance), you may wish to use another flavor focused on performance, such as OpenBLAS. To use OpenBLAS, you need to either find a pre-built OpenBLAS-enabled Numpy package or recompile a version linked against OpenBLAS. Once you are using an efficient BLAS implementation, you won't find a better speedup option in pure python, unless you write a library in C and take much time to optimize it.

On the other hand, you can check if your Numpy and BLAS library are compiled as efficiently as possible on your architecture. For instance, if you can activate the OpenMP library on Numpy compilation, it would allow multiple cores to work on your problem using data-level parallelism. This can be a significant source of speedup if you possess multiple cores on your computer and your computations are CPU-bound. If your kind of problem allows it, you could even use a task-based parallel programming library (SCOOP [Disclamer: I wrote it], Celery, etc.) to propagate your work on multiple computers.

As a last resort, another possibility would be to buy new hardware. It makes software potentially go faster without changing a single line of code.

answered Sep 18 '22 10:09

Soravux

Related questions
                            
                                Store versioned history of Field in a Django model
                            
                                Is multiprocessing.Manager().dict().setdefault() broken?
                            
                                How to read an entire web page into a variable
                            
                                Homebrew Python and writing to /Library/Python/2.7/site-packages/?
                            
                                How do I change the scale of imshow in matplotlib without stretching the image?
                            
                                pyPdf ignores newlines in PDF file
                            
                                Django unit test database not being torn down?
                            
                                Filling out a form using PyQt and QWebview
                            
                                ImportError: No module named twisted
                            
                                How can I recreate a Django project's database?
                            
                                Is there a Cake equivalent for Python?
                            
                                Remove utm_* parameters from URL in Python
                            
                                Configuring child loggers
                            
                                Django get_models with models/__init.py__
                            
                                How to overwrite the dump/load methods in the pickle class - customizing pickling and unpickling - Python
                            
                                Select all text in a textbox Selenium RC using Ctrl + A
                            
                                Upload a File with Python
                            
                                Django - Testing - Problems with @login_required decorator
                            
                                How to add PyPi dependencies to DEB package
                            
                                Pyramid.security questions: Double cookies? Insecure cookies? Expiration?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to accelerate matrix multiplications in Python?

Tags:

python

optimization

parallel-processing

numpy

blas

PierreE

People also ask

2 Answers

Danica

Soravux

Recent Activity

Donate For Us