Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Persistent threads in OpenCL and CUDA

I have read some papers talking about "persistent threads" for GPGPU, but I don't really understand it. Can any one give me an example or show me the use of this programming fashion?

What I keep in my mind after reading and googling "persistent threads":

Presistent Threads it's no more than a while loop that keep thread running and computing a lot of bunch of works.

Is this correct? Thanks in advance

Reference: http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1089 http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0157-GTC2012-Persistent-Threads-Computing.pdf

like image 618
AmineMs Avatar asked Feb 11 '13 21:02

AmineMs


People also ask

What is the difference between OpenCL and CUDA?

Open-source vs commercial Another highly recognized difference between CUDA and OpenCL is that OpenCL is Open-source and CUDA is a proprietary framework of NVIDIA. This difference brings its own pros and cons and the general decision on this has to do with your app of choice.

Which is faster CUDA or OpenCL?

A study that directly compared CUDA programs with OpenCL on NVIDIA GPUs showed that CUDA was 30% faster than OpenCL.

How many threads can be executed at a time in CUDA?

These can come from 2 thread blocks of 32 warps, or 3 thread blocks of 21 warps, 4 thread blocks of 16 warps, and so on up to 16 blocks of 4 warps; there is another hard upper limit of 16 thread blocks simultaneously active on a single multiprocessor.

What is a persistent thread?

The Persistent Threads style of programming alters the notion of the lifetime of virtual software threads, bring- ing them closer to execution lifetime of physical hardware threads, i.e. the developer's view is that threads are active for the entire duration of a kernel.


2 Answers

CUDA exploits the Single Instruction Multiple Data (SIMD) programming model. The computational threads are organized in blocks and the thread blocks are assigned to a different Streaming Multiprocessor (SM). The execution of a thread block on a SM is performed by arranging the threads in warps of 32 threads: each warp operates in lock-step and executes exactly the same instruction on different data.

Generally, to fill up the GPU, the kernel is launched with much more blocks that can actually be hosted on the SMs. Since not all the blocks can be hosted on a SM, a work scheduler performs a context switch when a block has finished computing. It should be noticed that the switching of the blocks is managed entirely in hardware by the scheduler, and the programmer has no means of influencing how blocks are scheduled onto the SM. This exposes a limitation for all those algorithms that do not perfectly fit a SIMD programming model and for which there is work imbalance. Indeed, a block A will not be replaced by another block B on the same SM until the last thread of block A will not have finished to execute.

Although CUDA does not expose the hardware scheduler to the programmer, the persistent threads style bypasses the hardware scheduler by relying on a work queue. When a block finishes, it checks the queue for more work and continues doing so until no work is left, at which point the block retires. In this way, the kernel is launched with as many blocks as the number of available SMs.

The persistent threads technique is better illustrated by the following example, which has been taken from the presentation

“GPGPU” computing and the CUDA/OpenCL Programming Model

Another more detailed example is available in the paper

Understanding the efficiency of ray traversal on GPUs

// Persistent thread: Run until work is done, processing multiple work per thread
// rather than just one. Terminates when no more work is available

// count represents the number of data to be processed

__global__  void persistent(int* ahead, int* bhead, int count, float* a, float* b)
{
    int local_input_data_index, local_output_data_index;
while ((local_input_data_index = read_and_increment(ahead)) <   count)
{                                   
        load_locally(a[local_input_data_index]);

        do_work_with_locally_loaded_data();

        int out_index = read_and_increment(bhead);

        write_result(b[out_index]);
    }
}

// Launch exactly enough threads to fill up machine (to achieve sufficient parallelism 
// and latency hiding)
persistent<<numBlocks,blockSize>>(ahead_addr, bhead_addr, total_count, A, B);
like image 72
Vitality Avatar answered Nov 19 '22 01:11

Vitality


Quite easy to understand. Usually each work item processed a small amount of work. If you want to save save workgroup switch time, you can let one work item process a lot of work using a loop. For instance, you have one image, and it is 1920x1080, you have 1920 workitem, and each work item processes one column of 1080 pixels using loop.

like image 21
Hunter Wang Avatar answered Nov 19 '22 00:11

Hunter Wang