Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cuda block synchronization

Tags:

cuda

I have b number of blocks and each block has t number of threads. I can use

 __syncthreads() 

to synchronize the threads that are in a particular block. for example

__global__ void aFunction() {     for(i=0;i<10;i++)     {        //execute something         __syncthreads();     } } 

But my problem is to synchronize all the threads in all the blocks. How can I do this?

like image 394
user570593 Avatar asked Jun 19 '11 20:06

user570593


2 Answers

In CUDA 9, NVIDIA is introducing the concept of cooperative groups, allowing you to synchronize all threads belonging to that group. Such a group can span over all threads in the grid. This way you will be able to synchronize all threads in all blocks:

#include <cuda_runtime_api.h>  #include <cuda.h>  #include <cooperative_groups.h>  cooperative_groups::grid_group g = cooperative_groups::this_grid();  g.sync(); 

You need a Pascal (compute capability 60) or a newer architecture to synchronize grids. In addition, there are more specific requirements. See: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#grid-synchronization-cg

Basic functionality, such as synchronizing groups smaller than a thread block down to warp granularity, is supported on all architectures, while Pascal and Volta GPUs enable new grid-wide and multi-GPU synchronizing groups.

Source: https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/


Before CUDA 9, there was no native way to synchronise all threads from all blocks. In fact, the concept of blocks in CUDA is that some may be launched only after some other blocks already ended its work, for example, if the GPU it is running on is too weak to process them all in parallel.

If you ensure that you don't spawn too many blocks, you can try to synchronise all blocks between themselves, e.g. by actively-waiting using atomic operations. This is however slow, eating up your GPU memory controller, is considered "a hack" and should be avoided.

So, if you don't target Pascal (or newer) architecture, the best way that I can suggest is to simply terminate your kernel at the synchronisation point, and then launch a new kernel which would continue with your job. In most circumstances it will actually perform faster (or at least - with simmilar speeds) than using the mentioned hack.

like image 135
CygnusX1 Avatar answered Nov 13 '22 07:11

CygnusX1


Cooperative groups have some requirements, such as needing to launch your kernel via cudaLaunchCooperativeKernel. Which makes it not a good solution for simple projects.

An easy alternative is using atomics with bitfields, like so:

// A global var with 64 bits can track 64 blocks,  // use an array if you need to track more blocks __device__ uint64_t CompleteMask;   //This is where we put in all the smarts //from the CPU reference solver __global__ void doWork() {     atomicAnd(&CompleteMask, 0);     //do lots of work      const auto SollMask = (1 << gridDim.x) - 1;     if (ThreadId() == 0) {         while ((atomicOr(&CompleteMask, 1ULL << blockIdx.x)) != SollMask) { /*do nothing*/ }     }     if (ThreadId() == 0 && 0 == blockIdx.x) {         printf("Print a single line for the entire process")     } } 

Because every block is assigned its own bit in the mask, they can never interfere. If you have more than 64 blocks, use an array to track the bits and atomicAdd to track the count like so:

// A global var with 64 bits can track 64 blocks,  // use an array if you need to track more blocks __device__ int CompleteMask[2]; __device__ int CompleteSuperMask;  __global__ void doWork() {     for (auto i = 0; i < 2; i++) { atomicAnd(&CompleteMask[i], 0); }     atomicAnd(&CompleteSuperMask, 0);     //do lots of work      int SollMask[3];     SollMask[0] = -1;     SollMask[1] = (1 << (gridDim.x % 32)) - 1;     SollMask[2] = (1 << (gridDim.x / 32)) - 1;      const auto b = blockIdx.x / 32;     while (atomicOr(&CompleteMask[b], (1U << (blockIdx.x % 32))) != SollMask[b]) { /*do nothing*/ }      while (atomicOr(&CompleteSuperMask, (1U << b)) != SollMask[2]) { /*do nothing*/ }     if (threadIdx.x == 0 && blockIdx.x == 0) {         printf("Print a single line for the entire process");     } } 
like image 22
Johan Avatar answered Nov 13 '22 07:11

Johan