Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallelism in GPU - CUDA / OpenCL

Tags:

cuda

opencl

I have a general questions about parallelism in CUDA or OpenCL code on GPU. I use NVIDIA GTX 470.

I read briefly in the Cuda programming guide, but did not find related answers hence asking here.

I have a top level function which calls the CUDA kernel(For same kernel I have a OpenCL version of it). This top level function itself is called 3 times in a 'for loop' from my main function, for 3 different data sets(Image data R,G,B) and the actual codelet also has processing over all the pixels in the image/frame so it has 2 'for loops'.

What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?

So what i want to understand is does does this CUDA and C code create multiple threads for different functionality/functions in the codelet and top level code and executes them in parallel and exploits task parallelism. If yes, who creates it as there is no threading library explicitly included in code or linked with.

OR

It creates threads/tasks for different 'for loop' iterations which are independent and thus achieving data parallelism. If it does this kind of parallelism, does it exploit this just by noting that different for loop iterations have no dependencies and hence can be scheduled in parallel?

Because I don't see any special compiler constructs/intrinsics(parallel for loops as in openMP) which tells the compiler/scheduler to schedule such for loops / functions in parallel?

Any reading material would help.

like image 262
goldenmean Avatar asked Dec 28 '22 08:12

goldenmean


2 Answers

Parallelism on GPUs is SIMT (Single Instruction Multiple Threads). For CUDA Kernels, you specify a grid of blocks where every block has N threads. The CUDA library does all the trick and the CUDA Compiler (nvcc) generates the GPU code which is executed by the GPU. The CUDA library tells the GPU driver and further more the thread scheduler on the GPU how many threads should execute the kernel ((number of blocks) x (number of threads)). In your example the top level function (or host function) executes only the kernel call which is asyncronous and returns emediatly. No threading library is needed because nvcc creates the calls to the driver.

A sample kernel call looks like this:

helloworld<<<BLOCKS, THREADS>>>(/* maybe some parameters */);

OpenCL follows the same paradigm but you compile yor kernel (if they are not precompiled) at runtime. Specify the number of threads to execute the kernel and the lib does the rest.

The best way to learn CUDA (OpenCL) is to look in the CUDA Programming Guide (OpenCL Programming Guide) and look at the samples in the GPU Computing SDK.

like image 156
Michael Haidl Avatar answered Jan 11 '23 02:01

Michael Haidl


What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?

Predominantly data parallelism, but there's also some task parallelism involved.

In your image processing example a kernel might do the processing for a single output pixel. You'd instruct OpenCL or CUDA to run as many threads as there are pixels in the output image. It then schedules those threads to run on the GPU/CPU that you're targeting.

Highly data parallel. Kernel is written to do a single work item, and you schedule millions of them.

The task parallelism comes in because your host program is still running on the CPU whilst the GPU is running all those threads, so it can be getting on with other work. Often this is preparing data for the next set of kernel threads, but it could be a completely separate task.

like image 39
Paul S Avatar answered Jan 11 '23 02:01

Paul S