Cuda Shared Memory array variable

Tags:

cuda

I am trying to declare a variable for matrix multiplication as follows:

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

I am trying to make it so the user could input the size of the matrix to calculate, however that would mean changing the BLOCK_SIZE. I changed it but I am getting a compiler error:"error: constant value is not known". I've looked into it and it's similar to this thread. So I tried:

Click to copy

__shared__ int buf [];

But then I get: "error: incomplete type is not allowed"

Thanks, Dan Update with code(pretty much followed this guide and the staring out with cuda guide): The block size is passed in by asking the user of the size of the matrix. They enter the x and y. Block size is only x and right now it has to accept the same size as x and y.

Click to copy

__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;

    // Thread index
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    // Index of the first sub-matrix of A processed 
    // by the block
    int aBegin = wA * block_size * by;

    // Index of the last sub-matrix of A processed 
    // by the block
    int aEnd   = aBegin + wA - 1;

    // Step size used to iterate through the 
    // sub-matrices of A
    int aStep  = block_size;

    // Index of the first sub-matrix of B processed 
    // by the block
    int bBegin = block_size * bx;

    // Step size used to iterate through the 
    // sub-matrices of B
    int bStep  = block_size * wB;
    float Csub=0;
    // Loop over all the sub-matrices of A and B
    // required to compute the block sub-matrix
    for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
    {
        // Declaration of the shared memory array As 
        // used to store the sub-matrix of A

        extern __shared__ float As[];

        // Declaration of the shared memory array Bs 
        // used to store the sub-matrix of B
        extern __shared__ float Bs[];
        extern __shared__ float smem[];

        // Load the matrices from global memory
        // to shared memory; each thread loads
        // one element of each matrix
        smem[ty*block_size+tx] = A[a + wA * ty + tx];
        //cuPrintf("\n\nWhat are the memory locations?\n");
        //cuPrintf("The shared memory(A) is: %.2f\n",smem[ty*block_size+tx]);
        smem[block_size*block_size+ty*block_size+tx]  = B[b + wB * ty + tx];
        //cuPrintf("The shared memory(B) is: %.2f\n",smem[block_size*block_size+ty*block_size+tx]);
        // Synchronize to make sure the matrices 
        // are loaded
        __syncthreads();

        // Multiply the two matrices together;
        // each thread computes one element
        // of the block sub-matrix
        for (int k = 0; k < block_size; ++k)
        {

            Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;
            //cuPrintf("Csub is currently: %.2f\n",Csub);
        }
        //cuPrintf("\n\n\n");
        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        //cuPrintf("the results are csub: %.2f\n",Csub);
        __syncthreads();
    }
    // Write the block sub-matrix to device memory;
    // each thread writes one element
    int c = wB * block_size * by + block_size * bx;
    C[c + wB * ty + tx] = Csub;


}

565

asked Feb 08 '12 04:02

Dan

2 Answers

extern __shared__ int buf[];

when you launch the kernel you should launch it this way;

kernel<<<blocks,threads,numbytes_for_shared>>>(...);

If you have multiple extern declaration of shared:

extern __shared__ float As[];

extern __shared__ float Bs[];

this will lead to As pointing to the same address as Bs.

You will need to keep As and Bs inside the 1D-array.

Click to copy

extern __shared__ float smem[];

When calling kernel, you should launch it with 2*BLOCK_SIZE*BLOCK_SIZE*sizeof(float).

When indexing into As, use smem[y*BLOCK_SIZE+x] and when indexing into Bs use smem[BLOCK_SIZE*BLOCK_SIZE+y*BLOCK_SIZE+x]

141

answered Oct 19 '22 04:10

brano

You have two choices for declaring shared memory inside a kernel - static or dynamic. I presume what you are doing at the moment looks something like this:

Click to copy

#define BLOCK_SIZE (16)

__global__ void sgemm0(const float *A, const float *B, float *C)
{
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

}

and you would like to be able to easily change BLOCK_SIZE.

One possibility is to continue to use static shared memory allocation, but make the allocation size a template parameter, like this:

Click to copy

template<int blocksize=16>
__global__ void sgemm1(const float *A, const float *B, float *C)
{
    __shared__ float As[blocksize][blocksize];

}
template void sgemm1<16>(const float *, const float *, float *C);

Then you can instantiate as many different block size variants at compile time as you need.

If you want to dynamically allocate the memory, define it like this:

Click to copy

__global__ void sgemm2(const float *A, const float *B, float *C)
{
    extern __shared__ float As[];

}

and then add the size of the allocation as an argument to the kernel call:

Click to copy

size_t blocksize = BLOCK_SIZE * BLOCK_SIZE;
sgemm2<<< gridDim, blockDim, sizeof(float)*blocksize >>>(....);

If you have multiple statically declared arrays which you wish to replace with dynamically allocated shared memory, then be aware that there is only ever one dynamic shared memory allocation per kernel, so multiple items exits within (share) that memory segment. So if you had something like:

Click to copy

#define BLOCK_SIZE (16)

__global__ void sgemm0(const float *A, const float *B, float *C)
{
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

}

you could replace it with:

Click to copy

#define BLOCK_SIZE (16)

__global__ void sgemm3(const float *A, const float *B, float *C)
{
    extern __shared__ float buffer[];

    float *As = &buffer[0];
    float *Bs = &buffer[BLOCK_SIZE*BLOCK_SIZE];

}

and launch the kernel like this:

Click to copy

size_t blocksize = 2 * BLOCK_SIZE * BLOCK_SIZE;
sgemm3<<< gridDim, blockDim, sizeof(float)*blocksize >>>(....);

All are equally valid, although I personally favour the template version because it can allow other compiler optimisation like automatic loop unrolling that the dynamic version cannot without extra work.

answered Oct 19 '22 05:10

talonmies

Related questions
                            
                                How to display the maximum value of a unsigned long long in C?
                            
                                Understanding the need for fflush() and problems associated with it
                            
                                Does gcc 4.8.1 enable sse by default?
                            
                                How to reverse bitwise AND (&) in C?
                            
                                How does the strlen function work internally?
                            
                                How to test to see if number is in range using Objective-C?
                            
                                C Program to find day of week given date
                            
                                Does relational operator affect assignment operator operations?
                            
                                Chaining Bool values give opposite result to expected
                            
                                How to answer this interview test about constant pointers?
                            
                                pointer default value .?
                            
                                add seconds to a date
                            
                                C - If realloc is used is free necessary?
                            
                                Using arrow -> and dot . operators together in C
                            
                                How does sizeof know the size of array? [duplicate]
                            
                                Why is this legal in C?
                            
                                Design code to fit in CPU Cache?
                            
                                Is there any LAME C++ wrapper\simplifier (working on Linux Mac and Win from pure code)?
                            
                                Different sizeof results
                            
                                How does the C offsetof macro work? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cuda Shared Memory array variable

Tags:

c

cuda

Dan

People also ask

2 Answers

brano

talonmies

Recent Activity

Donate For Us