I have read some papers talking about "persistent threads" for GPGPU, but I don't really understand it. Can any one give me an example or show me the use of this programming fashion?
What I keep in my mind after reading and googling "persistent threads":
Presistent Threads it's no more than a while loop that keep thread running and computing a lot of bunch of works.
Is this correct? Thanks in advance
Reference: http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1089 http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0157-GTC2012-Persistent-Threads-Computing.pdf
Open-source vs commercial Another highly recognized difference between CUDA and OpenCL is that OpenCL is Open-source and CUDA is a proprietary framework of NVIDIA. This difference brings its own pros and cons and the general decision on this has to do with your app of choice.
A study that directly compared CUDA programs with OpenCL on NVIDIA GPUs showed that CUDA was 30% faster than OpenCL.
These can come from 2 thread blocks of 32 warps, or 3 thread blocks of 21 warps, 4 thread blocks of 16 warps, and so on up to 16 blocks of 4 warps; there is another hard upper limit of 16 thread blocks simultaneously active on a single multiprocessor.
The Persistent Threads style of programming alters the notion of the lifetime of virtual software threads, bring- ing them closer to execution lifetime of physical hardware threads, i.e. the developer's view is that threads are active for the entire duration of a kernel.
CUDA exploits the Single Instruction Multiple Data (SIMD) programming model. The computational threads are organized in blocks and the thread blocks are assigned to a different Streaming Multiprocessor (SM). The execution of a thread block on a SM is performed by arranging the threads in warps of 32
threads: each warp operates in lock-step and executes exactly the
same instruction on different data.
Generally, to fill up the GPU, the kernel is launched with much more blocks that can actually be hosted on the SMs. Since not all the blocks can be hosted on a SM, a work scheduler performs a context switch when a block has finished computing. It should be noticed that the switching of the blocks is managed entirely in hardware by the scheduler, and the programmer has no means of influencing how blocks are scheduled onto the SM. This exposes a limitation for all those algorithms that do not perfectly fit a SIMD programming model and for which there is work imbalance. Indeed, a block A
will not be replaced by another block B
on the same SM until the last thread of block A
will not have finished to execute.
Although CUDA does not expose the hardware scheduler to the programmer, the persistent threads style bypasses the hardware scheduler by relying on a work queue. When a block finishes, it checks the queue for more work and continues doing so until no work is left, at which point the block retires. In this way, the kernel is launched with as many blocks as the number of available SMs.
The persistent threads technique is better illustrated by the following example, which has been taken from the presentation
“GPGPU” computing and the CUDA/OpenCL Programming Model
Another more detailed example is available in the paper
Understanding the efficiency of ray traversal on GPUs
// Persistent thread: Run until work is done, processing multiple work per thread
// rather than just one. Terminates when no more work is available
// count represents the number of data to be processed
__global__ void persistent(int* ahead, int* bhead, int count, float* a, float* b)
{
int local_input_data_index, local_output_data_index;
while ((local_input_data_index = read_and_increment(ahead)) < count)
{
load_locally(a[local_input_data_index]);
do_work_with_locally_loaded_data();
int out_index = read_and_increment(bhead);
write_result(b[out_index]);
}
}
// Launch exactly enough threads to fill up machine (to achieve sufficient parallelism
// and latency hiding)
persistent<<numBlocks,blockSize>>(ahead_addr, bhead_addr, total_count, A, B);
Quite easy to understand. Usually each work item processed a small amount of work. If you want to save save workgroup switch time, you can let one work item process a lot of work using a loop. For instance, you have one image, and it is 1920x1080, you have 1920 workitem, and each work item processes one column of 1080 pixels using loop.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With