I am wondering how much GPU computing would help me speed up my simulations. The critical part of my code is matrix multiplication. Basically the code looks like the following python code with matrices of order 1000 and long for loops. <pre class="prettyprint"><code>import numpy as np m_size = 1000 sim_length = 50 a = np.random.rand(m_size, m_size) b = np.random.rand(m_size, m_size) for j in range(sim_length): result = np.dot(a,b) </code></pre> Note: My matrices are dense, mostly random and for loops are compiled with cython. My naive guess would be that I have two factors: <ul> <li>More parallel threads (Currently of order 1 thread, GPUs of order 100 threads?) --> Speedup of order 100? [Source is quite outdated, from 2011]</li> <li>Lower processor frequency (Currently 3Ghz, GPUs typically 2 Ghz) --> Neglect</li> </ul> I expect that this viewpoint is to naive, so what am I missing?

<h3>Matrix multiplication performance</h3> If you use <code>numpy</code>, you are probably using one of the BLAS libraries as computational backend, such as ATLAS, OpenBLAS, MKL, etc. When you are using the fastest one MKL, you can find a recent performance benchmark here, between a recent Nvidia GPU K40m and Intel Xeon 12-core E5-2697 v2 @ 2.70GHz https://developer.nvidia.com/cublas where K40m is 6x faster than 12-thread E5-2697. Considering MKL scales well on multi-core CPU. K40m is ~72x faster than 1-thread E5-2697. Please also note 1000-dim is almost the lower bound to fully utilise both the GPU and CPU. Smaller matrix size usually leads to more performance degrade on GPU. If you are using slower BLAS backend for <code>numpy</code>, say the GNU-licensed ATLAS. You could then find the comparison between MKL and ATLAS here https://software.intel.com/en-us/intel-mkl/benchmarks#DGEMM-ATLAS where MKL is 2~4x faster than ATLAS. For Nvidia GPUs, the only widely used backend is CUDA's cuBLAS, so the performance won't change a lot like ATLAS vs. MKL. <h3>Data transfer</h3> As @janbrohl says, data transfer between host RAM and GPU device memory is an important factor that affect the overall performance. Here's a benchmark of the data transfer speed. CUDA - how much slower is transferring over PCI-E? Given the matrix size, you can actually calculate out the absolute time for computation and data transfer, respectively. These could help you evaluate the performance better. To maximise the performance on GPU, you probably need re-design you program to minimise the data transfer, by moving all the computational operations to GPU, rather than matrix multiplication only.

Speedup GPU vs CPU for matrix operations

Tags:

python

gpgpu

gpu

matrix-multiplication

I am wondering how much GPU computing would help me speed up my simulations.

The critical part of my code is matrix multiplication. Basically the code looks like the following python code with matrices of order 1000 and long for loops.

import numpy as np
m_size = 1000
sim_length = 50

a = np.random.rand(m_size, m_size)
b = np.random.rand(m_size, m_size)

for j in range(sim_length):
    result = np.dot(a,b)

Note: My matrices are dense, mostly random and for loops are compiled with cython.

My naive guess would be that I have two factors:

More parallel threads (Currently of order 1 thread, GPUs of order 100 threads?) --> Speedup of order 100? [Source is quite outdated, from 2011]
Lower processor frequency (Currently 3Ghz, GPUs typically 2 Ghz) --> Neglect

I expect that this viewpoint is to naive, so what am I missing?

492

asked Aug 01 '16 16:08

physicsGuy

1 Answers

Matrix multiplication performance

If you use numpy, you are probably using one of the BLAS libraries as computational backend, such as ATLAS, OpenBLAS, MKL, etc. When you are using the fastest one MKL, you can find a recent performance benchmark here, between a recent Nvidia GPU K40m and Intel Xeon 12-core E5-2697 v2 @ 2.70GHz

https://developer.nvidia.com/cublas

where K40m is 6x faster than 12-thread E5-2697. Considering MKL scales well on multi-core CPU. K40m is ~72x faster than 1-thread E5-2697. Please also note 1000-dim is almost the lower bound to fully utilise both the GPU and CPU. Smaller matrix size usually leads to more performance degrade on GPU.

If you are using slower BLAS backend for numpy, say the GNU-licensed ATLAS. You could then find the comparison between MKL and ATLAS here

https://software.intel.com/en-us/intel-mkl/benchmarks#DGEMM-ATLAS

where MKL is 2~4x faster than ATLAS.

For Nvidia GPUs, the only widely used backend is CUDA's cuBLAS, so the performance won't change a lot like ATLAS vs. MKL.

Data transfer

As @janbrohl says, data transfer between host RAM and GPU device memory is an important factor that affect the overall performance. Here's a benchmark of the data transfer speed.

CUDA - how much slower is transferring over PCI-E?

Given the matrix size, you can actually calculate out the absolute time for computation and data transfer, respectively. These could help you evaluate the performance better.

To maximise the performance on GPU, you probably need re-design you program to minimise the data transfer, by moving all the computational operations to GPU, rather than matrix multiplication only.

169

answered Sep 24 '22 23:09

kangshiyin

Related questions
                            
                                Reading a Matlab's cell array saved as a v7.3 .mat file with H5py
                            
                                List of unicode character names
                            
                                Iterate over non None items in Python
                            
                                Does Python really create all bound method for every new instance?
                            
                                show matplotlib colorbar instead of legend for multiple plots with gradually changing colors
                            
                                How to replace many 'if...elif' statements in Python? [duplicate]
                            
                                Runtime error when trying to logout django
                            
                                pandas - plot sorted column to increasing integer index
                            
                                What are the parentheses for at the end of Python method names? [duplicate]
                            
                                Get a list of file names from HDFS using python
                            
                                os.walk very slow, any way to optimise?
                            
                                Run Web app with Bokeh plots in an offline mode? Where to dl Required Bokeh files
                            
                                python converting video to audio
                            
                                Convert Pandas dataframe to list of list with index, data, and columns
                            
                                To replace but the last occurrence of string in a text [duplicate]
                            
                                Fastest way to find Indexes of item in list?
                            
                                How to filter a Spark dataframe by a boolean column?
                            
                                How to use Keras' multi layer perceptron for multi-class classification
                            
                                How to remove dates from a list in Python
                            
                                Can you have required keyword arguments in Javascript or Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With