CUDA how to get grid, block, thread size and parallalize non square matrix calculation

Tags:

I am new to CUDA and need help understanding some things. I need help parallelizing these two for loops. Specifically how to setup the dimBlock and dimGrid to make this run faster. I know this looks like the vector add example in the sdk but that example is only for square matrices and when I try to modify that code for my 128 x 1024 matrix it doesn't work properly.

__global__ void mAdd(float* A, float* B, float* C) {     for(int i = 0; i < 128; i++)     {         for(int j = 0; j < 1024; j++)         {             C[i * 1024 + j] = A[i * 1024 + j] + B[i * 1024 + j];         }     } }

This code is part of a larger loop and is the simplest portion of the code, so I decided to try to paralleize thia and learn CUDA at same time. I have read the guides but still do not understand how to get the proper no. of grids/block/threads going and use them effectively.

998

asked Apr 13 '11 00:04

user656210

1 Answers

As you have written it, that kernel is completely serial. Every thread launched to execute it is going to performing the same work.

The main idea behind CUDA (and OpenCL and other similar "single program, multiple data" type programming models) is that you take a "data parallel" operation - so one where the same, largely independent, operation must be performed many times - and write a kernel which performs that operation. A large number of (semi)autonomous threads are then launched to perform that operation across the input data set.

In your array addition example, the data parallel operation is

C[k] = A[k] + B[k];

for all k between 0 and 128 * 1024. Each addition operation is completely independent and has no ordering requirements, and therefore can be performed by a different thread. To express this in CUDA, one might write the kernel like this:

__global__ void mAdd(float* A, float* B, float* C, int n) {     int k = threadIdx.x + blockIdx.x * blockDim.x;      if (k < n)         C[k] = A[k] + B[k]; }

[disclaimer: code written in browser, not tested, use at own risk]

Here, the inner and outer loop from the serial code are replaced by one CUDA thread per operation, and I have added a limit check in the code so that in cases where more threads are launched than required operations, no buffer overflow can occur. If the kernel is then launched like this:

const int n = 128 * 1024; int blocksize = 512; // value usually chosen by tuning and hardware constraints int nblocks = n / blocksize; // value determine by block size and total work  madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);

Then 256 blocks, each containing 512 threads will be launched onto the GPU hardware to perform the array addition operation in parallel. Note that if the input data size was not expressible as a nice round multiple of the block size, the number of blocks would need to be rounded up to cover the full input data set.

All of the above is a hugely simplified overview of the CUDA paradigm for a very trivial operation, but perhaps it gives enough insight for you to continue yourself. CUDA is rather mature these days and there is a lot of good, free educational material floating around the web you can probably use to further illuminate many of the aspects of the programming model I have glossed over in this answer.

112

answered Sep 26 '22 03:09

talonmies

Related questions
                            
                                CSS3 animation with gradients [duplicate]
                            
                                What is the difference between "undefined" and undefined?
                            
                                C++ on Small-Footprint Microcontrollers
                            
                                What does the javac debugging information option -g:vars do?
                            
                                How to make an image button in JSF
                            
                                Array size limit in PHP
                            
                                Large Binary (byte[]) File transfer through WCF
                            
                                changing visibility using javascript
                            
                                Generate Full URL ASP.NET MVC Razor Views
                            
                                Check for XML errors using JavaScript
                            
                                SharedSizeGroup in ListView.ItemTemplate
                            
                                Django: Filtering by %filter% not allowed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With