Why should I use CUDA __shared__ memory as "extern"

Question

An example how to use dynamically allocated and therefore extern shared memory in CUDA is given: Use dynamic shared memory allocation for two different vectors

    extern __shared__ float array[];
    __device__ void func()      // __device__ or __global__ function
    {
        short* array0 = (short*)array; 
        float* array1 = (float*)&array0[128];
        int*   array2 =   (int*)&array1[64];
    }

But why should I ever use extern dynamically allocated shared memory when I have to assign it to variables manually anyway?

I do not see any drawback to the following solution:

    __device__ void func()      // __device__ or __global__ function
    {
        __shared__ float array[MAXIMALLY_NEEDED_SIZE];
        short* array0 = (short*)array;
        float* array1 = (float*)&array0[128];
        int*   array2 =   (int*)&array1[64];
    }

Clearly, with the first solution, I can save some shared memory. But how would this ever help me?

(I quess there's a good reason for dynamically allocated memory, but I don't see it, so I probably have a lack in my understanding. That's why I ask.)

Robert Crovella · Accepted Answer

The reason to use dynamically allocated shared memory (as opposed to statically allocated) is similar to one reason why you might want to allocate anything dynamically instead of statically: at compile-time, you don't know the size of the allocation you will want.

The example you've given doesn't illustrate this point very well. The original purpose of that example was to illustrate how to handle multiple independent objects residing in shared memory in the dynamically allocated case, not to highlight the uses of dynamic vs. static shared memory.

Clearly, with the first solution, I can save some shared memory. But how would this ever help me?

One possible reason that saving shared memory might be valuable, is because it can impact occupancy, and thus performance.

Suppose I had a parallel reduction code, and suppose that it used shared memory as the primary reduction medium. Typically the amount of shared memory I will need will be related to the number of threads I use in my threadblock. Now let's also suppose that depending on the exact problem I have, I may want to adjust the number of threads per threadblock, at runtime.

If I launch a threadblock of 256 threads, and I am doing a parallel reduction on 64-bit integers, I may need 256*8bytes (2KB) of shared memory per threadblock. If I launch a threadblock of 1024 threads, I would need 8KB of shared memory per threadblock (this is the maximum conceivable).

If I simply hard-coded this value, so that it could be used at compile-time as part of a static allocation, I would need to use the 8KB value. This would limit me to 6 threadblocks maximum occupancy on most GPUs (6*8KB = 48KB maximum shared memory), even if I were launching threadblocks of only 256 threads. (And if I needed any shared memory for any other purpose, then my max occupancy would be less than 6 threadblocks.)

With dynamic allocation, threadblocks of 1024 threads still have the same limits as above, but threadblocks launched with 256 threads will be able to achieve theoretically higher occupancy (at least based on shared memory limits), which could translate to higher performance.

Why should I use CUDA shared memory as "extern"

Tags:

c

cuda

Michael

1 Answers

Robert Crovella

Recent Activity

Donate For Us