Is it worthwhile to pass kernel parameters via shared memory?

Tags:

Suppose that we have an array int * data, each thread will access one element of this array. Since this array will be shared among all threads it will be saved inside the global memory.

Let's create a test kernel:

 __global__ void test(int *data, int a, int b, int c){ ... }

I know for sure that the data array will be in global memory because I allocated memory for this array using cudaMalloc. Now as for the other variables, I've seen some examples that pass an integer without allocating memory, immediately to the kernel function. In my case such variables are a b and c.

If I'm not mistaken, even though we do not call directly cudaMalloc to allocate 4 bytes for each three integers, CUDA will automatically do it for us, so in the end the variables a b and c will be allocated in the global memory.

Now these variables, are only auxiliary, the threads only read them and nothing else.

My question is, wouldn't it be better to transfer these variables to the shared memory?

I imagine that if we had for example 10 blocks with 1024 threads, we would need 10*3 = 30 reads of 4 bytes in order to store the numbers in the shared memory of each block.

Without shared memory and if each thread has to read all these three variables once, the total amount of global memory reads will be 1024*10*3 = 30720 which is very inefficient.

Now here is the problem, I'm somewhat new to CUDA and I'm not sure if it's possible to transfer the memory for variables a b and c to the shared memory of each block without having each thread reading these variables from the global memory and loading them to the shared memory, so in the end the total amount of global memory reads would be 1024*10*3 = 30720 and not 10*3 = 30.

On the following website there is this example:

 __global__ void staticReverse(int *d, int n)
 {
    __shared__ int s[64];
    int t = threadIdx.x;
    int tr = n-t-1;
    s[t] = d[t];
    __syncthreads();
   d[t] = s[tr];
 }

Here each thread loads different data inside the shared variable s. So each thread, according to its index, loads the specified data inside the shared memory.

In my case, I want to load only variables a b and c to the shared memory. These variables are always the same, they don't change, so they don't have anything to do with the threads themselves, they are auxiliary and are being used by each thread to run some algorithm.

How should I approach this problem? Is it possible to achieve this by only doing total_amount_of_blocks*3 global memory reads?

613

asked May 25 '13 23:05

ksm001

1 Answers

The GPU runtime already does this optimally without you needing to do anything (and your assumption about how argument passing works in CUDA is incorrect). This is presently what happens:

In compute capability 1.0/1.1/1.2/1.3 devices, kernel arguments are passed by the runtime in shared memory.
In compute capability 2.x/3.x/4.x/5.x/6.x devices, kernel arguments are passed by the runtime in a reserved constant memory bank (which has a dedicated cache with broadcast).

So in your hypothetical kernel

__global__ void test(int *data, int a, int b, int c){ ... }

data, a, b, and c are all passed by value to each block in either shared memory or constant memory (depending on GPU architecture) automatically. There is no advantage in doing what you propose.

100

answered Sep 19 '22 18:09

talonmies

Related questions
                            
                                CUDA programming - L1 and L2 caches
                            
                                GPU reads from CPU or CPU writes to the GPU?
                            
                                What is the size of my CUDA texture memory?
                            
                                detecting NVIDIA GPUs without CUDA
                            
                                CUDA: Is the texture cache on FERMI separate from L1 cache?
                            
                                CUDA How to access constant memory in device kernel when the constant memory is declared in the host code?
                            
                                Is Cuda 6 supported with Visual Studio 2013?
                            
                                CUDA missing libGL.so libGLU.so and libX11.so
                            
                                What are templated methods in C++ / CUDA with 3 angle brackets (<<<)?
                            
                                Mathematica and CUDA
                            
                                Proper way to write kernel functions in CUDA?
                            
                                Calling a kernel from a kernel
                            
                                Control flow divergence in SIMT and SIMD
                            
                                Gradient Descent Optimization in CUDA
                            
                                How can I implement a custom atomic function involving several variables?
                            
                                Emulating FP64 with 2 FP32 on a GPU
                            
                                Tensorflow: CUDA_VISIBLE_DEVICES doesn't seem to work
                            
                                CMake CUDA separate compilation static lib link error on Windows but not on Ubuntu
                            
                                Expected number of bank conflicts in shared memory at random access
                            
                                how to link library (e.g. CUBLAS, CUSPARSE) for CUDA on windows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it worthwhile to pass kernel parameters via shared memory?

Tags:

cuda

gpu

gpu-shared-memory

ksm001

People also ask

1 Answers

talonmies

Recent Activity

Donate For Us