Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cuda Shared Memory array variable

Tags:

c

cuda

I am trying to declare a variable for matrix multiplication as follows:

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

I am trying to make it so the user could input the size of the matrix to calculate, however that would mean changing the BLOCK_SIZE. I changed it but I am getting a compiler error:"error: constant value is not known". I've looked into it and it's similar to this thread. So I tried:

__shared__ int buf [];

But then I get: "error: incomplete type is not allowed"

Thanks, Dan Update with code(pretty much followed this guide and the staring out with cuda guide): The block size is passed in by asking the user of the size of the matrix. They enter the x and y. Block size is only x and right now it has to accept the same size as x and y.

__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;

    // Thread index
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    // Index of the first sub-matrix of A processed 
    // by the block
    int aBegin = wA * block_size * by;

    // Index of the last sub-matrix of A processed 
    // by the block
    int aEnd   = aBegin + wA - 1;

    // Step size used to iterate through the 
    // sub-matrices of A
    int aStep  = block_size;

    // Index of the first sub-matrix of B processed 
    // by the block
    int bBegin = block_size * bx;

    // Step size used to iterate through the 
    // sub-matrices of B
    int bStep  = block_size * wB;
    float Csub=0;
    // Loop over all the sub-matrices of A and B
    // required to compute the block sub-matrix
    for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
    {
        // Declaration of the shared memory array As 
        // used to store the sub-matrix of A

        extern __shared__ float As[];

        // Declaration of the shared memory array Bs 
        // used to store the sub-matrix of B
        extern __shared__ float Bs[];
        extern __shared__ float smem[];

        // Load the matrices from global memory
        // to shared memory; each thread loads
        // one element of each matrix
        smem[ty*block_size+tx] = A[a + wA * ty + tx];
        //cuPrintf("\n\nWhat are the memory locations?\n");
        //cuPrintf("The shared memory(A) is: %.2f\n",smem[ty*block_size+tx]);
        smem[block_size*block_size+ty*block_size+tx]  = B[b + wB * ty + tx];
        //cuPrintf("The shared memory(B) is: %.2f\n",smem[block_size*block_size+ty*block_size+tx]);
        // Synchronize to make sure the matrices 
        // are loaded
        __syncthreads();

        // Multiply the two matrices together;
        // each thread computes one element
        // of the block sub-matrix
        for (int k = 0; k < block_size; ++k)
        {

            Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;
            //cuPrintf("Csub is currently: %.2f\n",Csub);
        }
        //cuPrintf("\n\n\n");
        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        //cuPrintf("the results are csub: %.2f\n",Csub);
        __syncthreads();
    }
    // Write the block sub-matrix to device memory;
    // each thread writes one element
    int c = wB * block_size * by + block_size * bx;
    C[c + wB * ty + tx] = Csub;


}
like image 565
Dan Avatar asked Feb 08 '12 04:02

Dan


People also ask

Can CUDA use shared memory?

Because CUDA shared memory is located on chip, its memory bandwidth is much larger than the global memory which is located off chip. Therefore, CUDA kernel optimization by caching memory access on shared memory can improve the performance of some operations significantly, especially for those memory-bound operations.

How is shared memory allocated?

Shared memory is allocated per thread block, so all threads in the block have access to the same shared memory. Threads can access data in shared memory loaded from global memory by other threads within the same thread block.

How do you use constant memory in CUDA?

A variable allocated in constant memory needs to be declared in CUDA by using the special __constant__ identifier, and it must be a global variable, i.e. it must be declared in the scope that contains the kernel, not inside the kernel itself.

What is global memory in GPU?

Global memory can be considered the main memory space of the GPU in CUDA. It is allocated, and managed, by the host, and it is accessible to both the host and the GPU, and for this reason the global memory space can be used to exchange data between the two.


2 Answers

extern __shared__ int buf[];

when you launch the kernel you should launch it this way;

kernel<<<blocks,threads,numbytes_for_shared>>>(...);

If you have multiple extern declaration of shared:

extern __shared__ float As[];

extern __shared__ float Bs[];

this will lead to As pointing to the same address as Bs.

You will need to keep As and Bs inside the 1D-array.

extern __shared__ float smem[];

When calling kernel, you should launch it with 2*BLOCK_SIZE*BLOCK_SIZE*sizeof(float).

When indexing into As, use smem[y*BLOCK_SIZE+x] and when indexing into Bs use smem[BLOCK_SIZE*BLOCK_SIZE+y*BLOCK_SIZE+x]

like image 141
brano Avatar answered Oct 19 '22 04:10

brano


You have two choices for declaring shared memory inside a kernel - static or dynamic. I presume what you are doing at the moment looks something like this:

#define BLOCK_SIZE (16)

__global__ void sgemm0(const float *A, const float *B, float *C)
{
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

}

and you would like to be able to easily change BLOCK_SIZE.

One possibility is to continue to use static shared memory allocation, but make the allocation size a template parameter, like this:

template<int blocksize=16>
__global__ void sgemm1(const float *A, const float *B, float *C)
{
    __shared__ float As[blocksize][blocksize];

}
template void sgemm1<16>(const float *, const float *, float *C);

Then you can instantiate as many different block size variants at compile time as you need.

If you want to dynamically allocate the memory, define it like this:

__global__ void sgemm2(const float *A, const float *B, float *C)
{
    extern __shared__ float As[];

} 

and then add the size of the allocation as an argument to the kernel call:

size_t blocksize = BLOCK_SIZE * BLOCK_SIZE;
sgemm2<<< gridDim, blockDim, sizeof(float)*blocksize >>>(....);

If you have multiple statically declared arrays which you wish to replace with dynamically allocated shared memory, then be aware that there is only ever one dynamic shared memory allocation per kernel, so multiple items exits within (share) that memory segment. So if you had something like:

#define BLOCK_SIZE (16)

__global__ void sgemm0(const float *A, const float *B, float *C)
{
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

}

you could replace it with:

#define BLOCK_SIZE (16)

__global__ void sgemm3(const float *A, const float *B, float *C)
{
    extern __shared__ float buffer[];

    float *As = &buffer[0];
    float *Bs = &buffer[BLOCK_SIZE*BLOCK_SIZE];

}

and launch the kernel like this:

size_t blocksize = 2 * BLOCK_SIZE * BLOCK_SIZE;
sgemm3<<< gridDim, blockDim, sizeof(float)*blocksize >>>(....);

All are equally valid, although I personally favour the template version because it can allow other compiler optimisation like automatic loop unrolling that the dynamic version cannot without extra work.

like image 41
talonmies Avatar answered Oct 19 '22 05:10

talonmies