Suppose that we have an array int * data
, each thread will access one element of this array. Since this array will be shared among all threads it will be saved inside the global memory.
Let's create a test kernel:
__global__ void test(int *data, int a, int b, int c){ ... }
I know for sure that the data
array will be in global memory because I allocated memory for this array using cudaMalloc
. Now as for the other variables, I've seen some examples that pass an integer without allocating memory, immediately to the kernel function. In my case such variables are a
b
and c
.
If I'm not mistaken, even though we do not call directly cudaMalloc
to allocate 4 bytes for each three integers, CUDA will automatically do it for us, so in the end the variables a
b
and c
will be allocated in the global memory.
Now these variables, are only auxiliary, the threads only read them and nothing else.
My question is, wouldn't it be better to transfer these variables to the shared memory?
I imagine that if we had for example 10
blocks with 1024
threads, we would need 10*3 = 30
reads of 4
bytes in order to store the numbers in the shared memory of each block.
Without shared memory and if each thread has to read all these three variables once, the total amount of global memory reads will be 1024*10*3 = 30720
which is very inefficient.
Now here is the problem, I'm somewhat new to CUDA and I'm not sure if it's possible to transfer the memory for variables a
b
and c
to the shared memory of each block without having each thread reading these variables from the global memory and loading them to the shared memory, so in the end the total amount of global memory reads would be 1024*10*3 = 30720
and not 10*3 = 30
.
On the following website there is this example:
__global__ void staticReverse(int *d, int n)
{
__shared__ int s[64];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
Here each thread loads different data inside the shared variable s
. So each thread, according to its index, loads the specified data inside the shared memory.
In my case, I want to load only variables a
b
and c
to the shared memory. These variables are always the same, they don't change, so they don't have anything to do with the threads themselves, they are auxiliary and are being used by each thread to run some algorithm.
How should I approach this problem? Is it possible to achieve this by only doing total_amount_of_blocks*3
global memory reads?
Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access because it is located on chip. Because shared memory is shared by threads in a thread block, it provides a mechanism for threads to cooperate.
Size and BandwidthPer-block shared memory is faster than global memory and constant memory, but is slower than the per-thread registers. Each block has a maximum of 48k of shared memory for K20. Per-thread registers can only hold a small amount of data, but are the fastest.
Persistence POSIX shared memory objects have kernel persistence: a shared memory object will exist until the system is shut down, or until all processes have unmapped the object and it has been deleted with shm_unlink(3) Linking Programs using the POSIX shared memory API must be compiled with cc -lrt to link against ...
Because shared memory is on chip, uncached shared memory latency is roughly 100 times slower than global memory. A bank conflict occurs if two or more threads access any bytes within different 32-bit words belonging to the same bank.
The GPU runtime already does this optimally without you needing to do anything (and your assumption about how argument passing works in CUDA is incorrect). This is presently what happens:
So in your hypothetical kernel
__global__ void test(int *data, int a, int b, int c){ ... }
data
, a
, b
, and c
are all passed by value to each block in either shared memory or constant memory (depending on GPU architecture) automatically. There is no advantage in doing what you propose.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With