Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

matrix multiplication in cuda

Tags:

cuda

say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks.

a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies one row and one column.

b) one thread to do each multiplication. Each element of the result matrix requires 50 threads. After multiplications are done, I can use binary reduction to sum the results.

I wasn't sure which way to take, so I took b. It wasn't ideal. In fact it was slow. Any idea why? My guess would be there are just too many threads and they are waiting for resource most of time, is that true?

like image 386
small_potato Avatar asked Oct 05 '10 02:10

small_potato


People also ask

Why are GPU good at matrix multiplication?

In your case of matrix multiplication. You can parallelize the computations, Because GPU have much more threads and in each thread you have multiple blocks. So a lot of computations are parallelized, resulting quick computations.

How much faster is GPU than CPU in matrix multiplication?

The results presented in this paper show that the GPU implementation with the use of shared memory is two times faster than the implementation that uses only device's global memory and up to 7.5 times faster than the CPU implementation.

Which algorithm is used for matrix multiplication?

In linear algebra, the Strassen algorithm, named after Volker Strassen, is an algorithm for matrix multiplication. It is faster than the standard matrix multiplication algorithm for large matrices, with a better asymptotic complexity, although the naive algorithm is often better for smaller matrices.

How do you do a matrix matrix multiplication in Python?

To multiply two matrices use the dot() function of NumPy. It takes only 2 arguments and returns the product of two matrices.


3 Answers

As with so many things in high performance computing, the key to understanding performance here is understanding the use of memory.

If you are using one thread do to do one multiplication, then for that thread you have to pull two pieces of data from memory, multiply them, then do some logarthmic number of adds. That's three memory accesses for a mult and an add and a bit - the arithmatic intensity is very low. The good news is that there are many many threads worth of tasks this way, each of which only needs a tiny bit of memory/registers, which is good for occupancy; but the memory access to work ratio is poor.

The simple one thread doing one dot product approach has the same sort of problem - each multiplication requires two memory accesses to load. The good news is that there's only one store to global memory for the whole dot product, and you avoid the binary reduction which doesn't scale as well and requires a lot of synchronization; the down side is there's way less threads now, which at least your (b) approach had working for you.

Now you know that there should be some way of doing more operations per memory access than this; for square NxN matricies, there's N^3 work to do the multiplication, but only 3xN^2 elements - so you should be able to find a way to do way more than 1 computation per 2ish memory accesses.

The approach taken in the CUDA SDK is the best way - the matricies are broken into tiles, and your (b) approach - one thread per output element - is used. But the key is in how the threads are arranged. By pulling in entire little sub-matricies from slow global memory into shared memory, and doing calculations from there, it's possible to do many multiplications and adds on each number you've read in from memory. This approach is the most successful approach in lots of applications, because getting data - whether it's over a network, or from main memory for a CPU, or off-chip access for a GPU - often takes much longer than processing the data.

There's documents in NVidia's CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) which describe their SDK example very nicely.

like image 196
Jonathan Dursi Avatar answered Sep 24 '22 12:09

Jonathan Dursi


Have you looked at the CUDA documentation: Cuda Programming Model

Also, sample source code: Matrix Multiplication

like image 23
Mitch Wheat Avatar answered Sep 21 '22 12:09

Mitch Wheat


Did you look at

$SDK/nvidia-gpu-sdk-3.1/C/src/matrixMul

i.e. the matrix multiplication example in the SDK?

like image 31
Dirk Eddelbuettel Avatar answered Sep 23 '22 12:09

Dirk Eddelbuettel