Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

About warp voting function

Tags:

cuda

gpgpu

gpu

The CUDA programming guide introduced the concept of warp vote function, "_all", "_any" and "__ballot".

My question is: what applications will use these 3 functions?

like image 977
Fan Zhang Avatar asked May 11 '12 19:05

Fan Zhang


People also ask

What is warp instruction?

A warp is a set of 32 threads within a thread block such that all the threads in a warp execute the same instruction. These threads are selected serially by the SM. Once a thread block is launched on a multiprocessor (SM), all of its warps are resident until their execution finishes.

What is warp in GPU programming?

In an NVIDIA GPU, the basic unit of execution is the warp. A warp is a collection of threads, 32 in current implementations, that are executed simultaneously by an SM. Multiple warps can be executed on an SM at once.

What is kernel function in CUDA?

The kernel is a function executed on the GPU. Every CUDA kernel starts with a __global__ declaration specifier. Programmers provide a unique global ID to each thread by using built-in variables. Figure 2. CUDA kernels are subdivided into blocks.

What are the three general section of CUDA program?

There are three key language extensions CUDA programmers can use—CUDA blocks, shared memory, and synchronization barriers. CUDA blocks contain a collection of threads. A block of threads can share memory, and multiple threads can pause until all threads reach a specified set of execution.


1 Answers

The prototype of __ballot is the following

unsigned int __ballot(int predicate);

If predicate is nonzero, __ballot returns a value with the Nth bit set, where N is the thread index.

Combined with atomicOr and __popc, it can be used to accumulate the number of threads in each warp having a true predicate.

Indeed, the prototype of atomicOr is

int atomicOr(int* address, int val);

and atomicOr reads the value pointed to by address, performs a bitwise OR operation with val, and writes the value back to address and returns its old value as a return parameter.

On the other side, __popc returns the number of bits set withing a 32-bit parameter.

Accordingly, the instructions

volatile __shared__ u32 warp_shared_ballot[MAX_WARPS_PER_BLOCK];

const u32 warp_sum = threadIdx.x >> 5;

atomicOr(&warp_shared_ballot[warp_num],__ballot(data[tid]>threshold));

atomicAdd(&block_shared_accumulate,__popc(warp_shared_ballot[warp_num]));

can be used to count the number of threads for which the predicate is true.

For more details, see Shane Cook, CUDA Programming, Morgan Kaufmann

like image 183
Vitality Avatar answered Jan 01 '23 17:01

Vitality