Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dynamic Shared Memory in CUDA

Tags:

cuda

There are similar questions to what I'm about to ask, but I feel like none of them get at the heart of what I'm really looking for. What I have now is a CUDA method that requires defining two arrays into shared memory. Now, the size of the arrays is given by a variable that is read into the program after the start of execution. Because of this, I cannot use that variable to define the size of the arrays, due to the fact that defining the size of shared arrays requires knowing the value at compile time. I do not want to do something like __shared__ double arr1[1000] because typing in the size by hand is useless to me as that will change depending on the input. In the same vein, I cannot use #define to create a constant for the size.

Now I can follow an example similar to what is in the manual (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared) such as

extern __shared__ float array[];
__device__ void func()      // __device__ or __global__ function
{
    short* array0 = (short*)array; 
    float* array1 = (float*)&array0[128];
    int*   array2 =   (int*)&array1[64];
}

But this still runs into an issue. From what I've read, defining a shared array always makes the memory address the first element. That means I need to make my second array shifted over by the size of the first array, as they appear to do in this example. But the size of the first array is dependent on user input.

Another question (Cuda Shared Memory array variable) has a similar issue, and they were told to create a single array that would act as the array for both arrays and simply adjust the indices to properly match the arrays. While this does seem to do what I want, it looks very messy. Is there any way around this so that I can still maintain two independent arrays, each with sizes that are defined as input by the user?

like image 530
zephyr Avatar asked Jul 24 '14 19:07

zephyr


People also ask

Can CUDA use shared memory?

Shared memory is a CUDA memory space that is shared by all threads in a thread block. In this case shared means that all threads in a thread block can write and read to block-allocated shared memory, and all changes to this memory will be eventually available to all threads in the block.

Where is CUDA shared memory?

Because CUDA shared memory is located on chip, its memory bandwidth is much larger than the global memory which is located off chip. Therefore, CUDA kernel optimization by caching memory access on shared memory can improve the performance of some operations significantly, especially for those memory-bound operations.

What memory system is used in CUDA?

CUDA also uses an abstract memory type called local memory. Local memory is not a separate memory system per se but rather a memory location used to hold spilled registers. Register spilling occurs when a thread block requires more register storage than is available on an SM.

How does CUDA allocate memory?

Memory management on a CUDA device is similar to how it is done in CPU programming. You need to allocate memory space on the host, transfer the data to the device using the built-in API, retrieve the data (transfer the data back to the host), and finally free the allocated memory.


1 Answers

When using dynamic shared memory with CUDA, there is one and only one pointer passed to the kernel, which defines the start of the requested/allocated area in bytes:

extern __shared__ char array[];

There is no way to handle it differently. However this does not prevent you from having two user-sized arrays. Here's a worked example:

$ cat t501.cu
#include <stdio.h>

__global__ void my_kernel(unsigned arr1_sz, unsigned arr2_sz){

  extern __shared__ char array[];

  double *my_ddata = (double *)array;
  char *my_cdata = arr1_sz*sizeof(double) + array;

  for (int i = 0; i < arr1_sz; i++) my_ddata[i] = (double) i*1.1f;
  for (int i = 0; i < arr2_sz; i++) my_cdata[i] = (char) i;

  printf("at offset %d, arr1: %lf, arr2: %d\n", 10, my_ddata[10], (int)my_cdata[10]);
}

int main(){
  unsigned double_array_size = 256;
  unsigned char_array_size = 128;
  unsigned shared_mem_size = (double_array_size*sizeof(double)) + (char_array_size*sizeof(char));
  my_kernel<<<1,1, shared_mem_size>>>(256, 128);
  cudaDeviceSynchronize();
  return 0;
}


$ nvcc -arch=sm_20 -o t501 t501.cu
$ cuda-memcheck ./t501
========= CUDA-MEMCHECK
at offset 10, arr1: 11.000000, arr2: 10
========= ERROR SUMMARY: 0 errors
$

If you have a random arrangement of arrays of mixed data types, you'll want to either manually align your array starting points (and request enough shared memory) or else use alignment directives (and be sure to request enough shared memory), or use structures to help with alignment.

like image 84
Robert Crovella Avatar answered Oct 02 '22 15:10

Robert Crovella