matrix multiplication in cuda

Tags:

cuda

say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks.

a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies one row and one column.

b) one thread to do each multiplication. Each element of the result matrix requires 50 threads. After multiplications are done, I can use binary reduction to sum the results.

I wasn't sure which way to take, so I took b. It wasn't ideal. In fact it was slow. Any idea why? My guess would be there are just too many threads and they are waiting for resource most of time, is that true?

386

asked Oct 05 '10 02:10

small_potato

3 Answers

As with so many things in high performance computing, the key to understanding performance here is understanding the use of memory.

If you are using one thread do to do one multiplication, then for that thread you have to pull two pieces of data from memory, multiply them, then do some logarthmic number of adds. That's three memory accesses for a mult and an add and a bit - the arithmatic intensity is very low. The good news is that there are many many threads worth of tasks this way, each of which only needs a tiny bit of memory/registers, which is good for occupancy; but the memory access to work ratio is poor.

The simple one thread doing one dot product approach has the same sort of problem - each multiplication requires two memory accesses to load. The good news is that there's only one store to global memory for the whole dot product, and you avoid the binary reduction which doesn't scale as well and requires a lot of synchronization; the down side is there's way less threads now, which at least your (b) approach had working for you.

Now you know that there should be some way of doing more operations per memory access than this; for square NxN matricies, there's N^3 work to do the multiplication, but only 3xN^2 elements - so you should be able to find a way to do way more than 1 computation per 2ish memory accesses.

The approach taken in the CUDA SDK is the best way - the matricies are broken into tiles, and your (b) approach - one thread per output element - is used. But the key is in how the threads are arranged. By pulling in entire little sub-matricies from slow global memory into shared memory, and doing calculations from there, it's possible to do many multiplications and adds on each number you've read in from memory. This approach is the most successful approach in lots of applications, because getting data - whether it's over a network, or from main memory for a CPU, or off-chip access for a GPU - often takes much longer than processing the data.

There's documents in NVidia's CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) which describe their SDK example very nicely.

196

answered Sep 24 '22 12:09

Jonathan Dursi

Have you looked at the CUDA documentation: Cuda Programming Model

Also, sample source code: Matrix Multiplication

answered Sep 21 '22 12:09

Mitch Wheat

Did you look at

$SDK/nvidia-gpu-sdk-3.1/C/src/matrixMul

i.e. the matrix multiplication example in the SDK?

answered Sep 23 '22 12:09

Dirk Eddelbuettel

Related questions
                            
                                Simple adding of two int's in Cuda, result always the same
                            
                                Bound CUDA texture reads zero
                            
                                Invalid device symbol when copying to CUDA constant memory
                            
                                cannot find -lcuda when linking with g++
                            
                                how can I use cudaStreamAddCallback() with a class member method?
                            
                                How to separate the kernel file CUDA with the main .cpp file
                            
                                thrust reduction result on device memory
                            
                                CUDA invalid device symbol error
                            
                                Using CUDA printf outside the kernel to print device variables
                            
                                CUDA Double pointer memory copy [duplicate]
                            
                                Is CUDA warp scheduling deterministic?
                            
                                Cuda: pinned memory zero copy problems
                            
                                Numba CUDA shared memory size at runtime?
                            
                                Doubling buffering in CUDA so the CPU can operate on data produced by a persistent kernel
                            
                                Implementing a quadtree using arrays
                            
                                Weak guarantees for non-atomic writes on GPUs?
                            
                                CMake 3.4.3 Can't find CUDA on windows
                            
                                Why does VS2019 Pro have compile errors with xutility, xmemory, and atomic when creating a CUDA project via CMake?
                            
                                Tutorial for CUDA + OpenGl [closed]
                            
                                Cummulative array summation using OpenCL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With