During my acquaintance with CUDA in Python (numba lib), I implemented matrix provide methods: <ul> <li>Just with <code>numpy.dot()</code> </li> <li>Strassen algorithm with <code>numpy.dot()</code> </li> <li>Blocks method on GPU</li> <li>Strassen algorithm on GPU</li> </ul> So I tested it on 2 types of data: <ul> <li><code>numpy.random.randint(0, 5, (N, N)) # with int32 elements</code></li> <li><code>numpy.random.random((N, N)) # with float64 elements</code></li> </ul> For int32 i obtained expected result, where my GPU algroithms performed better than CPU with numpy: <img src="https://i.stack.imgur.com/mauQZ.png" alt="enter image description here"> However, on float64 type, <code>numpy.dot()</code> outperformed all my GPU methods: <img src="https://i.stack.imgur.com/DDkkL.png" alt="enter image description here"> So, question is: Why is <code>numpy.dot()</code> so fast with <code>float64</code> arrays, and does numpy use the GPU?

A typical installation of numpy will be dynamically linked against a BLAS library, which provides routines for matrix-matrix and matrix-vector multiplication. For example, when you use <code>np.dot()</code> on a pair of float64 arrays, numpy will call the BLAS <code>dgemm</code> routine in the background. Although these library functions run on the CPU rather than the GPU, they are often multithreaded, and are very finely tuned for performance. A good BLAS implementation, such as MKL or OpenBLAS, will probably be hard to beat in terms of performance, even on the GPU*. However, BLAS only supports floating point types. If you call <code>np.dot()</code> on integer arrays, numpy will fall back on using a very simple internal C++ implementation, which is single-threaded and much slower than a BLAS dot on two floating point arrays. Without knowing more about how you conducted those benchmarks, I would bet that a plain call to <code>numpy.dot</code> would also comfortably beat your other 3 methods for float32, complex64 and complex128 arrays, which are the other 3 types supported by BLAS. <hr> * One possible way to beat standard BLAS would be to use cuBLAS, which is a BLAS implementation that will run on an NVIDIA GPU. The <code>scikit-cuda</code> library seems to provide Python bindings for it, although I've never used it myself.

Python matrix provide with numpy.dot()

1 Answers

A typical installation of numpy will be dynamically linked against a BLAS library, which provides routines for matrix-matrix and matrix-vector multiplication. For example, when you use np.dot() on a pair of float64 arrays, numpy will call the BLAS dgemm routine in the background. Although these library functions run on the CPU rather than the GPU, they are often multithreaded, and are very finely tuned for performance. A good BLAS implementation, such as MKL or OpenBLAS, will probably be hard to beat in terms of performance, even on the GPU*.

However, BLAS only supports floating point types. If you call np.dot() on integer arrays, numpy will fall back on using a very simple internal C++ implementation, which is single-threaded and much slower than a BLAS dot on two floating point arrays.

Without knowing more about how you conducted those benchmarks, I would bet that a plain call to numpy.dot would also comfortably beat your other 3 methods for float32, complex64 and complex128 arrays, which are the other 3 types supported by BLAS.

* One possible way to beat standard BLAS would be to use cuBLAS, which is a BLAS implementation that will run on an NVIDIA GPU. The scikit-cuda library seems to provide Python bindings for it, although I've never used it myself.

173

answered Oct 07 '22 14:10

ali_m

Related questions
                            
                                Interrupting a thread in Python with a KeyboardException in the main thread
                            
                                How to install python debug-info for gdb?
                            
                                Why does a successful assertEqual not always imply a successful assertItemsEqual?
                            
                                Implementation of the Gauss-Newton method from Wikipedia example
                            
                                ValueError: Cannot assign User: issue on my OneToOneField relationship
                            
                                Python threads and GIL
                            
                                Converting LEFT OUTER JOIN query to Django orm queryset/query
                            
                                In python, returning an object created in a function body makes a deep copy of it?
                            
                                Setting columns for an empty pandas dataframe
                            
                                Replace function in main module from imported module in Python
                            
                                django - unique_together change - any danger in prod db?
                            
                                Numpy: set one specific element of each column based on indexing by array
                            
                                Shared variable with GPIO callback function with Raspberry
                            
                                Flask Logging - Debug setting
                            
                                Compute values in vector with NumPy
                            
                                Python version incorrectly identified
                            
                                How to generate decimal in range for 24 bit?
                            
                                Python use timeout for subprocess with Popen
                            
                                Numba's jit fails to compile function that has another function as input
                            
                                Obtain x'th largest item in a dictionary

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python matrix provide with numpy.dot()

Tags:

performance

python

numpy

matrix-multiplication

Mikhail

People also ask

1 Answers

ali_m

Recent Activity

Donate For Us