The CUDA programming guide introduced the concept of warp vote function, "_all", "_any" and "__ballot".
My question is: what applications will use these 3 functions?
A warp is a set of 32 threads within a thread block such that all the threads in a warp execute the same instruction. These threads are selected serially by the SM. Once a thread block is launched on a multiprocessor (SM), all of its warps are resident until their execution finishes.
In an NVIDIA GPU, the basic unit of execution is the warp. A warp is a collection of threads, 32 in current implementations, that are executed simultaneously by an SM. Multiple warps can be executed on an SM at once.
The kernel is a function executed on the GPU. Every CUDA kernel starts with a __global__ declaration specifier. Programmers provide a unique global ID to each thread by using built-in variables. Figure 2. CUDA kernels are subdivided into blocks.
There are three key language extensions CUDA programmers can use—CUDA blocks, shared memory, and synchronization barriers. CUDA blocks contain a collection of threads. A block of threads can share memory, and multiple threads can pause until all threads reach a specified set of execution.
The prototype of __ballot
is the following
unsigned int __ballot(int predicate);
If predicate
is nonzero, __ballot
returns a value with the N
th bit set, where N
is the thread index.
Combined with atomicOr
and __popc
, it can be used to accumulate the number of threads in each warp having a true predicate.
Indeed, the prototype of atomicOr
is
int atomicOr(int* address, int val);
and atomicOr
reads the value pointed to by address
, performs a bitwise OR
operation with val
, and writes the value back to address
and returns its old value as a return parameter.
On the other side, __popc
returns the number of bits set withing a 32
-bit parameter.
Accordingly, the instructions
volatile __shared__ u32 warp_shared_ballot[MAX_WARPS_PER_BLOCK];
const u32 warp_sum = threadIdx.x >> 5;
atomicOr(&warp_shared_ballot[warp_num],__ballot(data[tid]>threshold));
atomicAdd(&block_shared_accumulate,__popc(warp_shared_ballot[warp_num]));
can be used to count the number of threads for which the predicate is true.
For more details, see Shane Cook, CUDA Programming, Morgan Kaufmann
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With