Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it worthwhile to pass kernel parameters via shared memory?

Suppose that we have an array int * data, each thread will access one element of this array. Since this array will be shared among all threads it will be saved inside the global memory.

Let's create a test kernel:

 __global__ void test(int *data, int a, int b, int c){ ... }

I know for sure that the data array will be in global memory because I allocated memory for this array using cudaMalloc. Now as for the other variables, I've seen some examples that pass an integer without allocating memory, immediately to the kernel function. In my case such variables are a b and c.

If I'm not mistaken, even though we do not call directly cudaMalloc to allocate 4 bytes for each three integers, CUDA will automatically do it for us, so in the end the variables a b and c will be allocated in the global memory.

Now these variables, are only auxiliary, the threads only read them and nothing else.

My question is, wouldn't it be better to transfer these variables to the shared memory?

I imagine that if we had for example 10 blocks with 1024 threads, we would need 10*3 = 30 reads of 4 bytes in order to store the numbers in the shared memory of each block.

Without shared memory and if each thread has to read all these three variables once, the total amount of global memory reads will be 1024*10*3 = 30720 which is very inefficient.

Now here is the problem, I'm somewhat new to CUDA and I'm not sure if it's possible to transfer the memory for variables a b and c to the shared memory of each block without having each thread reading these variables from the global memory and loading them to the shared memory, so in the end the total amount of global memory reads would be 1024*10*3 = 30720 and not 10*3 = 30.

On the following website there is this example:

 __global__ void staticReverse(int *d, int n)
 {
    __shared__ int s[64];
    int t = threadIdx.x;
    int tr = n-t-1;
    s[t] = d[t];
    __syncthreads();
   d[t] = s[tr];
 }

Here each thread loads different data inside the shared variable s. So each thread, according to its index, loads the specified data inside the shared memory.

In my case, I want to load only variables a b and c to the shared memory. These variables are always the same, they don't change, so they don't have anything to do with the threads themselves, they are auxiliary and are being used by each thread to run some algorithm.

How should I approach this problem? Is it possible to achieve this by only doing total_amount_of_blocks*3 global memory reads?

like image 613
ksm001 Avatar asked May 25 '13 23:05

ksm001


People also ask

Is shared memory useful?

Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access because it is located on chip. Because shared memory is shared by threads in a thread block, it provides a mechanism for threads to cooperate.

Is shared memory faster than global memory?

Size and BandwidthPer-block shared memory is faster than global memory and constant memory, but is slower than the per-thread registers. Each block has a maximum of 48k of shared memory for K20. Per-thread registers can only hold a small amount of data, but are the fastest.

Is shared memory kernel persistent?

Persistence POSIX shared memory objects have kernel persistence: a shared memory object will exist until the system is shut down, or until all processes have unmapped the object and it has been deleted with shm_unlink(3) Linking Programs using the POSIX shared memory API must be compiled with cc -lrt to link against ...

Is shared memory slower?

Because shared memory is on chip, uncached shared memory latency is roughly 100 times slower than global memory. A bank conflict occurs if two or more threads access any bytes within different 32-bit words belonging to the same bank.


1 Answers

The GPU runtime already does this optimally without you needing to do anything (and your assumption about how argument passing works in CUDA is incorrect). This is presently what happens:

  • In compute capability 1.0/1.1/1.2/1.3 devices, kernel arguments are passed by the runtime in shared memory.
  • In compute capability 2.x/3.x/4.x/5.x/6.x devices, kernel arguments are passed by the runtime in a reserved constant memory bank (which has a dedicated cache with broadcast).

So in your hypothetical kernel

__global__ void test(int *data, int a, int b, int c){ ... }

data, a, b, and c are all passed by value to each block in either shared memory or constant memory (depending on GPU architecture) automatically. There is no advantage in doing what you propose.

like image 100
talonmies Avatar answered Sep 19 '22 18:09

talonmies