Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inter-block barrier on CUDA

Tags:

c

cuda

gpgpu

nvidia

I want to implement a Inter-block barrier on CUDA, but encountering a serious problem.

I cannot figure out why it does not work.

#include <iostream>
#include <cstdlib>
#include <ctime>

#define SIZE 10000000
#define BLOCKS 100 

using namespace std;

struct Barrier {
    int *count;

    __device__ void wait() {
        atomicSub(count, 1);
        while(*count)
            ;
    }

    Barrier() {
        int blocks = BLOCKS;
        cudaMalloc((void**) &count, sizeof(int));
        cudaMemcpy(count, &blocks, sizeof(int), cudaMemcpyHostToDevice);
    }

    ~Barrier() {
        cudaFree(count);
    }
};


__global__ void sum(int* vec, int* cache, int *sum, Barrier barrier)
{
    int tid = blockIdx.x;

    int temp = 0;
    while(tid < SIZE) {
        temp += vec[tid];
        tid += gridDim.x;
    }

    cache[blockIdx.x] = temp;

    barrier.wait();

    if(blockIdx.x == 0) {
        for(int i = 0 ; i < BLOCKS; ++i)
            *sum += cache[i];
    }
}

int main()
{
    int* vec_host = (int *) malloc(SIZE * sizeof(int));    
    for(int i = 0; i < SIZE; ++i)
        vec_host[i] = 1;

    int *vec_dev;
    int *sum_dev;
    int *cache;
    int sum_gpu = 0;

    cudaMalloc((void**) &vec_dev, SIZE * sizeof(int));
    cudaMemcpy(vec_dev, vec_host, SIZE * sizeof(int), cudaMemcpyHostToDevice);
    cudaMalloc((void**) &sum_dev, sizeof(int));
    cudaMemcpy(sum_dev, &sum_gpu, sizeof(int), cudaMemcpyHostToDevice);
    cudaMalloc((void**) &cache, BLOCKS * sizeof(int));
    cudaMemset(cache, 0, BLOCKS * sizeof(int));

    Barrier barrier;
    sum<<<BLOCKS, 1>>>(vec_dev, cache, sum_dev, barrier);

    cudaMemcpy(&sum_gpu, sum_dev, sizeof(int), cudaMemcpyDeviceToHost);

    cudaFree(vec_dev);
    cudaFree(sum_dev);
    cudaFree(cache);
    free(vec_host);
    return 0;
}

In fact, even if I rewrite the wait() as the following

    __device__ void wait() {
        while(*count != 234124)
            ;
    }

The program exits normally. But I expect to get an infinite loop in this case.

like image 273
Zifei Tong Avatar asked Oct 09 '11 12:10

Zifei Tong


1 Answers

Unfortunately, what you want to achieve (inter-block communication/synchronization) isn't strictly possible in CUDA. The CUDA programming guide states that "thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series." The reason for this restriction is to allow flexibility in the thread block scheduler, and to allow the code to agnostically scale with the number of cores. The only supported inter-block synchronization method is to launch another kernel: kernel launches (within the same stream) are implicit synchronization points.

Your code violates the block independence rule because it implicitly assumes that your kernel's thread blocks execute concurrently (cf. in parallel). But there's no guarantee that they do. To see why this matters to your code, let's consider a hypothetical GPU with only one core. We'll also assume that you only want to launch two thread blocks. Your spinloop kernel will actually deadlock in this situation. If thread block zero is scheduled on the core first, it will loop forever when it gets to the barrier, because thread block one never has a chance to update the counter. Because thread block zero is never swapped out (thread blocks execute to their completion) it starves thread block one of the core while it spins.

Some folks have tried schemes such as yours and have seen success because the scheduler happened to serendipitously schedule blocks in such a way that the assumptions worked out. For example, there was a time when launching as many thread blocks as a GPU has SMs meant that the blocks were truly executed concurrently. But they were disappointed when a change to the driver or CUDA runtime or GPU invalidated that assumption, breaking their code.

For your application, try to find a solution which doesn't depend on inter-block synchronization, because (barring a signification change to the CUDA programming model) it just isn't possible.

like image 55
Jared Hoberock Avatar answered Nov 15 '22 11:11

Jared Hoberock