CUDA Block and Grid size efficiencies

Tags:

What is the advised way of dealing with dynamically-sized datasets in cuda?

Is it a case of 'set the block and grid sizes based on the problem set' or is it worthwhile to assign block dimensions as factors of 2 and have some in-kernel logic to deal with the over-spill?

I can see how this probably matters alot for the block dimensions, but how much does this matter to the grid dimensions? As I understand it, the actual hardware constraints stop at the block level (i.e blocks assigned to SM's that have a set number of SP's, and so can handle a particular warp size).

I've perused Kirk's 'Programming Massively Parallel Processors' but it doesn't really touch on this area.

577

asked Apr 27 '11 20:04

Bolster

2 Answers

It s usually a case of setting block size for optimal performance, and grid size according to the total amount of work. Most kernels have a "sweet spot" number of warps per Mp where they work best, and you should do some benchmarking/profiling to see where that is. You probably still need over-spill logic in the kernel because problem sizes are rarely round multiples of block sizes.

EDIT: To give a concrete example of how this might be done for a simple kernel (in this case a custom BLAS level 1 dscal type operation done as part of a Cholesky factorization of packed symmetric band matrices):

// Fused square root and dscal operation
__global__ 
void cdivkernel(const int n, double *a)
{
    __shared__ double oneondiagv;

    int imin = threadIdx.x + blockDim.x * blockIdx.x;
    int istride = blockDim.x * gridDim.x;

    if (threadIdx.x == 0) {
        oneondiagv = rsqrt( a[0] );
    }
    __syncthreads();

    for(int i=imin; i<n; i+=istride) {
        a[i] *= oneondiagv;
    }
}

To launch this kernel, the execution parameters are calculated as follows:

We allow up to 4 warps per block (so 128 threads). Normally you would fix this at an optimal number, but in this case the kernel is often called on very small vectors, so having a variable block size made some sense.
We then compute the block count according to the total amount of work, up to 112 total blocks, which is the equivalent of 8 blocks per MP on a 14 MP Fermi Telsa. The kernel will iterate if the amount of work exceeds grid size.

The resulting wrapper function containing the execution parameter calculations and kernel launch look like this:

// Fused the diagonal element root and dscal operation into
// a single "cdiv" operation
void fusedDscal(const int n, double *a)
{
    // The semibandwidth (column length) determines
    // how many warps are required per column of the 
    // matrix.
    const int warpSize = 32;
    const int maxGridSize = 112; // this is 8 blocks per MP for a Telsa C2050

    int warpCount = (n / warpSize) + (((n % warpSize) == 0) ? 0 : 1);
    int warpPerBlock = max(1, min(4, warpCount));

    // For the cdiv kernel, the block size is allowed to grow to
    // four warps per block, and the block count becomes the warp count over four
    // or the GPU "fill" whichever is smaller
    int threadCount = warpSize * warpPerBlock;
    int blockCount = min( maxGridSize, max(1, warpCount/warpPerBlock) );
    dim3 BlockDim = dim3(threadCount, 1, 1);
    dim3 GridDim  = dim3(blockCount, 1, 1);

    cdivkernel<<< GridDim,BlockDim >>>(n,a);
    errchk( cudaPeekAtLastError() );
}

Perhaps this gives some hints about how to design a "universal" scheme for setting execution parameters against input data size.

141

answered Oct 05 '22 08:10

talonmies

Ok I guess we are dealing with two questions here.

1) Good way to assign block sizes (i.e. the number of threads) This usually depends on the kind of data you are dealing with. Are you dealing with vectors ? Are you dealing with matrices ? The suggested way is to keep the number of threads in multiples of 32. So when dealing with vectors, launching 256 x 1, 512 x 1 blocks may be fine. And similariy when dealing with matrices, 32 x 8, 32 x 16.

2) Good way to assign grid sizes (i.e. the number of blocks) It gets a bit tricky over here. Just launching 10,000 blocks because we can is not normally the best way to do things. Switching blocks in and out of hardware is costly. Two things to consider are the shared memory being used per block, and the total number of SPs available, and solve for the optimal number.

You can find a really good implementation of how to do that from thrust. It may take a while to figure out what's happening inside the code though.

answered Oct 05 '22 08:10

Pavan Yalamanchili

Related questions
                            
                                SVG: when to use animVal / baseVal
                            
                                Difference between initialization of static variables in C and C++
                            
                                StatsD and Graphite-like tools for .Net and Windows [closed]
                            
                                HTTP request cost vs. page size cost?
                            
                                How to Maintain VOIP socket connection in background?
                            
                                UML association vs. composition and detail level
                            
                                Programmatically Checking if a Passcode Lock is Set
                            
                                What's the difference between async and nonblocking in unix socket?
                            
                                Array to Object and Object to Array in PHP - interesting behaviour
                            
                                How to save and load an array of complex numbers using numpy.savetxt?
                            
                                Is it okay to forgo getters and setters for simple classes?
                            
                                A good way to avoid "sharing"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With