Persistent threads in OpenCL and CUDA

2 Answers

CUDA exploits the Single Instruction Multiple Data (SIMD) programming model. The computational threads are organized in blocks and the thread blocks are assigned to a different Streaming Multiprocessor (SM). The execution of a thread block on a SM is performed by arranging the threads in warps of 32 threads: each warp operates in lock-step and executes exactly the same instruction on different data.

Generally, to fill up the GPU, the kernel is launched with much more blocks that can actually be hosted on the SMs. Since not all the blocks can be hosted on a SM, a work scheduler performs a context switch when a block has finished computing. It should be noticed that the switching of the blocks is managed entirely in hardware by the scheduler, and the programmer has no means of influencing how blocks are scheduled onto the SM. This exposes a limitation for all those algorithms that do not perfectly fit a SIMD programming model and for which there is work imbalance. Indeed, a block A will not be replaced by another block B on the same SM until the last thread of block A will not have finished to execute.

Although CUDA does not expose the hardware scheduler to the programmer, the persistent threads style bypasses the hardware scheduler by relying on a work queue. When a block finishes, it checks the queue for more work and continues doing so until no work is left, at which point the block retires. In this way, the kernel is launched with as many blocks as the number of available SMs.

The persistent threads technique is better illustrated by the following example, which has been taken from the presentation

“GPGPU” computing and the CUDA/OpenCL Programming Model

Another more detailed example is available in the paper

Understanding the efficiency of ray traversal on GPUs

// Persistent thread: Run until work is done, processing multiple work per thread
// rather than just one. Terminates when no more work is available

// count represents the number of data to be processed

__global__  void persistent(int* ahead, int* bhead, int count, float* a, float* b)
{
    int local_input_data_index, local_output_data_index;
while ((local_input_data_index = read_and_increment(ahead)) <   count)
{                                   
        load_locally(a[local_input_data_index]);

        do_work_with_locally_loaded_data();

        int out_index = read_and_increment(bhead);

        write_result(b[out_index]);
    }
}

// Launch exactly enough threads to fill up machine (to achieve sufficient parallelism 
// and latency hiding)
persistent<<numBlocks,blockSize>>(ahead_addr, bhead_addr, total_count, A, B);

answered Nov 19 '22 01:11

Vitality

Quite easy to understand. Usually each work item processed a small amount of work. If you want to save save workgroup switch time, you can let one work item process a lot of work using a loop. For instance, you have one image, and it is 1920x1080, you have 1920 workitem, and each work item processes one column of 1080 pixels using loop.

answered Nov 19 '22 00:11

Hunter Wang

Related questions
                            
                                What does random_ints(a,N) do and how do I use it in my code? [closed]
                            
                                Equivalent of cudaGetErrorString for cuBLAS?
                            
                                cudaMemcpy() vs cudaMemcpyFromSymbol()
                            
                                cudaMallocManaged() returns "operation not supported"
                            
                                is it possible to develop a cuda program in a virtual machine that has a ubuntu installed
                            
                                Nvidia Tesla vs 480 for CUDA programming [closed]
                            
                                About warp voting function
                            
                                CUDA allocation alignment is 256 bytes - seriously?
                            
                                atomicAdd() for double on GPU
                            
                                cudaDeviceSynchronize() waits to finish only in current CUDA context or in all contexts?
                            
                                Error: could not insert 'nvidia_352' after Installing Cuda on EC2 g2.2xlarge
                            
                                What's the meaning of the params x,y,z,w in function cudaCreateChannelDesc
                            
                                Cuda - nvcc - No kernel image is available for execution on the device. What is the problem?
                            
                                cudaMemcpy - copy an int from host to device error
                            
                                maximum number of threads per block
                            
                                Division of floating point numbers on GPU different from that on CPU
                            
                                Trouble launching CUDA kernels from static initialization code
                            
                                Trouble compiling helloworld.cu
                            
                                Performance penalty when invoking a cuda kernel
                            
                                Passing a class object to a kernel

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Persistent threads in OpenCL and CUDA

Tags:

cuda

gpgpu

gpu

opencl

AmineMs

People also ask

2 Answers

Vitality

Hunter Wang

Recent Activity

Donate For Us