I am wondering how much GPU computing would help me speed up my simulations.
The critical part of my code is matrix multiplication. Basically the code looks like the following python code with matrices of order 1000 and long for loops.
import numpy as np
m_size = 1000
sim_length = 50
a = np.random.rand(m_size, m_size)
b = np.random.rand(m_size, m_size)
for j in range(sim_length):
result = np.dot(a,b)
Note: My matrices are dense, mostly random and for loops are compiled with cython.
My naive guess would be that I have two factors:
I expect that this viewpoint is to naive, so what am I missing?
In your case of matrix multiplication. You can parallelize the computations, Because GPU have much more threads and in each thread you have multiple blocks. So a lot of computations are parallelized, resulting quick computations.
The results presented in this paper show that the GPU implementation with the use of shared memory is two times faster than the implementation that uses only device's global memory and up to 7.5 times faster than the CPU implementation.
Because GPUs have more cores and can perform parallel operations on multiple sets of data, they more than catch up to the processing speeds commonly needed for non-graphical tasks, such as machine learning and scientific computation.
Because basic numerical linear algebra operations play crucial roles in real time 3D computer graphics, GPUs are designed for this set of operations. Because GPUs offer higher peak performance and bandwidth, numerical linear algebra applications can deliver much higher performance than merely using multi-core CPUs.
If you use numpy
, you are probably using one of the BLAS libraries as computational backend, such as ATLAS, OpenBLAS, MKL, etc. When you are using the fastest one MKL, you can find a recent performance benchmark here, between a recent Nvidia GPU K40m and Intel Xeon 12-core E5-2697 v2 @ 2.70GHz
https://developer.nvidia.com/cublas
where K40m is 6x faster than 12-thread E5-2697. Considering MKL scales well on multi-core CPU. K40m is ~72x faster than 1-thread E5-2697. Please also note 1000-dim is almost the lower bound to fully utilise both the GPU and CPU. Smaller matrix size usually leads to more performance degrade on GPU.
If you are using slower BLAS backend for numpy
, say the GNU-licensed ATLAS. You could then find the comparison between MKL and ATLAS here
https://software.intel.com/en-us/intel-mkl/benchmarks#DGEMM-ATLAS
where MKL is 2~4x faster than ATLAS.
For Nvidia GPUs, the only widely used backend is CUDA's cuBLAS, so the performance won't change a lot like ATLAS vs. MKL.
As @janbrohl says, data transfer between host RAM and GPU device memory is an important factor that affect the overall performance. Here's a benchmark of the data transfer speed.
CUDA - how much slower is transferring over PCI-E?
Given the matrix size, you can actually calculate out the absolute time for computation and data transfer, respectively. These could help you evaluate the performance better.
To maximise the performance on GPU, you probably need re-design you program to minimise the data transfer, by moving all the computational operations to GPU, rather than matrix multiplication only.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With