I’m getting confused about how to use shared and global memory in CUDA, especially with respect to the following:
cudaMalloc(), do we get a pointer to shared or global
memory?Is storing a variable in shared memory the same as passing its address via the kernel? I.e. instead of having
__global__ void kernel() {
   __shared__ int i;
   foo(i);
}
why not equivalently do
__global__ void kernel(int *i_ptr) {
   foo(*i_ptr);
}
int main() {
   int *i_ptr;
   cudaMalloc(&i_ptr, sizeof(int));
   kernel<<<blocks,threads>>>(i_ptr);
}
There've been many questions about specific speed issues in global vs shared memory, but none encompassing an overview of when to use either one in practice.
Many thanks
There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior. So far in this series we have used global memory, which resides in device DRAM, for transfers between the host and device as well as for the data input to and output from kernels.
In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs.
This type of memory is what integrated graphics eg Intel HD series typically use. This is not on your NVIDIA GPU, and CUDA can't use it.
The bandwidth of shared memory is 32 bits per bank per clock cycle. Because shared memory is on chip, uncached shared memory latency is roughly 100 times slower than global memory.
When we use cudaMalloc()
In order to store data on the gpu that can be communicated back to the host, we need to have alocated memory that lives until it is freed, see global memory as the heap space with life until the application closes or is freed, it is visible to any thread and block that have a pointer to that memory region. Shared memory can be considered as stack space with life until a block of a kernel finishes, the visibility is limited to only threads within the same block. So cudaMalloc is used to allocate space in global memory.
Do we get a pointer to shared or global memory?
You will get a pointer to a memory address residing in the global memory.
Does global memory reside on the host or device?
Global memory resides on the device. However, there is ways to use the host memory as "global" memory using mapped memory, see: CUDA Zero Copy memory considerations however, it may be slow speeds due to bus transfer speed limitations.
Is there a size limit to either one?
The size of the Global memory depends from card to card, anything from none to 32GB (V100). While the shared memory depend on the compute capability. Anything below compute capability 2.x have a maximum 16KB of shared memory per multiprocessor(where the amount of multiprocessors varies from card to card). And cards with compute capability of 2.x and greater have an minimum of 48KB of shared memory per multiprocessor.
See https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications
If you are using mapped memory, the only limitation is how much the host machine have in memory.
Which is faster to access?
In terms of raw numbers, shared memory is much faster (shared memory ~1.7TB/s, while global memory ~ XXXGB/s). However, in order to do anything you need to fill the shared memory with something, you usually pull from the global memory. If the memory access to global memory is coalesced(non-random) and big word size, you can achieve speeds close to the theoretical limit of hundreds of GB/s depending on the card and its memory interface.
The use of shared memory is when you need to within a block of threads, reuse data already pulled or evaluated from global memory. So instead of pulling from global memory again, you put it in the shared memory for other threads within the same block to see and reuse.
It is also common to be used as a scratch pad in order to reduce register pressure affecting how many work groups can be run at the same time.
Is storing a variable in shared memory the same as passing its address via the kernel?
No, if you pass an address of anything, it always is an address to global memory. From the host you can't set the shared memory, unless you pass it either as an constant where the kernel sets the shared memory to that constant, or you pass it an address to global memory where it is pulled by the kernel when needed.
The contents of global memory are visible to all the threads of grid. Any thread can read and write to any location of the global memory.
Shared memory is separate for each block of the grid. Any thread of a block can read and write to the shared memory of that block. A thread in one block cannot access shared memory of another block.
cudaMalloc always allocates global memory.16 KB/Block, compute 2.0 onwards have 48 KB/Block shared memory by default.Update:
Devices of Compute Capability 7.0 (Volta Architecture) allow allocating shared memory of up-to 96 KB per block, provided the following conditions are satisfied.
cudaFuncSetAttribute as follows.__global__ void MyKernel(...)
{
    extern __shared__ float shMem[];
}
int bytes = 98304; //96 KB
cudaFuncSetAttribute(MyKernel, cudaFuncAttributeMaxDynamicSharedMemorySize, bytes);
MyKernel<<<gridSize, blockSize, bytes>>>(...);
CUDA shared memory is memory shared between the threads within a block, i.e. between blocks in a grid the contents of shared memory are undefined. It can be thought as a manually managed L2 cache.
Usually global memory resides on the device, but recent versions of CUDA (if the device supports it) can map host memory into device address space, triggering an in-situ DMA transfer from host to device memory in such occasions.
There's a size limit on shared memory, depending on the device. Its reported in the device capabilities, retrieved when enumerating CUDA devices. Global memory is limited by the total memory available to the GPU. For example a GTX680 offers 48kiB of shared memory and 2GiB device memory.
Shared memory is faster to access than global memory, but access patterns must be aligned carefully (for both shared and global memory) to be efficient. If you can't make your access patterns properly aligned, use textures (also global memory, but accessed through a different circurity and cache, that can deal better with unaligned access).
Is storing a variable in shared memory the same as passing its address via the kernel?
No, definitely not. The code you proposed would be a case where you'd use in-situ transferred global memory. Shared memory can not be passed between kernels, as the contents of a shared block are defined within a execution block of threads only.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With