Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA: Tiled matrix-matrix multiplication with shared memory and matrix size which is non-multiple of the block size

Tags:

c

matrix

cuda

I'm trying to familiarize myself with CUDA programming, and having a pretty fun time of it. I'm currently looking at this pdf which deals with matrix multiplication, done with and without shared memory. Full code for both versions can be found here. This code is almost the exact same as what's in the CUDA matrix multiplication samples. Although the non-shared memory version has the capability to run at any matrix size, regardless of block size, the shared memory version must work with matrices that are a multiple of the block size (which I set to 4, default was originally 16).

One of the problems suggested at the end of the pdf is to change it so that the shared memory version can also work with non-multiples of the block size. I thought this would be a simple index check, like in the non-shared version:

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row > A.height || col > B.width) return;

But this doesn't work. Here's the full code, minus the main method (a bit of a mess, sorry), which has been modified somewhat by me:

void MatMul(const Matrix A, const Matrix B, Matrix C) { 
  // Load A and B to device memory 
  Matrix d_A; 
  d_A.width = d_A.stride = A.width; 
  d_A.height = A.height; 
  size_t size = A.width * A.height * sizeof(float); 
  cudaError_t err = cudaMalloc(&d_A.elements, size); 
  printf("CUDA malloc A: %s\n",cudaGetErrorString(err)); 
  err = cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice); 
  printf("Copy A to device: %s\n",cudaGetErrorString(err)); 

  Matrix d_B; 
  d_B.width = d_B.stride = B.width; 
  d_B.height = B.height; 
  size = B.width * B.height * sizeof(float); 
  err = cudaMalloc(&d_B.elements, size); 
  printf("CUDA malloc B: %s\n",cudaGetErrorString(err));
  err = cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice);
  printf("Copy B to device: %s\n",cudaGetErrorString(err)); 

  Matrix d_C; 
  d_C.width = d_C.stride = C.width; 
  d_C.height = C.height; 
  size = C.width * C.height * sizeof(float); 
  err = cudaMalloc(&d_C.elements, size); 
  printf("CUDA malloc C: %s\n",cudaGetErrorString(err));

  dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); 
    dim3 dimGrid((B.width + dimBlock.x - 1) / dimBlock.x, (A.height + dimBlock.y-1) / dimBlock.y);
    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); 
    err = cudaThreadSynchronize();
    printf("Run kernel: %s\n", cudaGetErrorString(err));

  // Read C from device memory 
  err = cudaMemcpy(C.elements, d_C.elements, size, cudaMemcpyDeviceToHost); 
  printf("Copy C off of device: %s\n",cudaGetErrorString(err));

  // Free device memory
  cudaFree(d_A.elements); 
  cudaFree(d_B.elements); 
  cudaFree(d_C.elements); 
} 

// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col) { 
  return A.elements[row * A.stride + col]; 
} 

// Set a matrix element 
__device__ void SetElement(Matrix A, int row, int col, float value) { 
  A.elements[row * A.stride + col] = value; 
} 

// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is 
// located col sub-matrices to the right and row sub-matrices down 
// from the upper-left corner of A 
__device__ Matrix GetSubMatrix(Matrix A, int row, int col) { 
  Matrix Asub; 
  Asub.width = BLOCK_SIZE; 
  Asub.height = BLOCK_SIZE; 
  Asub.stride = A.stride; 
  Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row + BLOCK_SIZE * col]; 
  return Asub; 
}


// Matrix multiplication kernel called by MatMul() 
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) { 
  // Block row and column 
  int blockRow = blockIdx.y; 
  int blockCol = blockIdx.x; 

  int rowTest = blockIdx.y * blockDim.y + threadIdx.y;
  int colTest = blockIdx.x * blockDim.x + threadIdx.x;
  if (rowTest>A.height || colTest>B.width)
    return;
  // Each thread block computes one sub-matrix Csub of C
  Matrix Csub = GetSubMatrix(C, blockRow, blockCol); 

  // Each thread computes one element of Csub 
  // by accumulating results into Cvalue 
  float Cvalue = 0.0; 
  // Thread row and column within Csub 
  int row = threadIdx.y; 
  int col = threadIdx.x; 
  // Loop over all the sub-matrices of A and B that are 
  // required to compute Csub 
  // Multiply each pair of sub-matrices together 
  // and accumulate the results 
  for (int m = 0; m < (BLOCK_SIZE + A.width - 1)/BLOCK_SIZE; ++m) {
    // Get sub-matrix Asub of A 
    Matrix Asub = GetSubMatrix(A, blockRow, m); 

    // Get sub-matrix Bsub of B 
    Matrix Bsub = GetSubMatrix(B, m, blockCol); 

    // Shared memory used to store Asub and Bsub respectively 
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; 
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; 

    // Load Asub and Bsub from device memory to shared memory 
    // Each thread loads one element of each sub-matrix 
    As[row][col] = GetElement(Asub, row, col); 
    Bs[row][col] = GetElement(Bsub, row, col); 

    // Synchronize to make sure the sub-matrices are loaded 
    // before starting the computation 
    __syncthreads(); 

    // Multiply Asub and Bsub together 
    for (int e = 0; e < BLOCK_SIZE; ++e) 
    {
      Cvalue += As[row][e] * Bs[e][col];
    }
    // Synchronize to make sure that the preceding 
    // computation is done before loading two new 
    // sub-matrices of A and B in the next iteration 
    __syncthreads();  
  }
  // Write Csub to device memory 
  // Each thread writes one element 
  SetElement(Csub, row, col, Cvalue); 
}

notable things which I changed: I added a check in MatMulKernel that checks if our current thread is trying to work on a spot in C that doesn't exist. This doesn't seem to work. Although it does change the result, the changes don't seem to have any pattern other than that later (higher x or y value) entries seem to be more affected (and I get a lot more non-integer results). I also changed the given dimGrid calculation method and the loop condition for m in MatMulKernel(before it was just width or height divided by block size, which seemed wrong).

Even the solutions guide that I found for this guide seems to suggest it should just be a simple index check, so I think I'm missing something really fundamental.

like image 924
Mike Avatar asked Sep 15 '13 17:09

Mike


People also ask

What is tiled matrix multiplication?

This is an algorithm performed on GPUs due to the parallel nature of matrix multiplication. We will especially look at a method called "tiling," which is used to reduce global memory accesses by taking advantage of the shared memory on the GPU. Tiling can be seen as a way to boost execution efficiency of the kernel.

What is shared memory in Cuda?

Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access because it is located on chip. Because shared memory is shared by threads in a thread block, it provides a mechanism for threads to cooperate.

What happens in matrix multiplication in CUDA?

Fig. 1: What happens in matrix multiplication? Obvious way to implement our parallel matrix multiplication in CUDA is to let each thread do a vector-vector multiplication i.e. each element in C matrix will be calculated by a separate CUDA thread. In CUDA programming model threads are organized into thread-blocks and grids.

How to specify the block and grid dimension using CUDA?

Since the output matrix is p × q, we need to have at least ⌈p/32⌉ number of thread-blocks in y -dimension and ⌈q/32⌉ number of thread-blocks in x -dimension (Fig. 2). So block and grid dimension can be specified as follows using CUDA. Here I assumed that columns in the matrix are indexed in x -dimension and rows in y -dimension.

What is the difference between CUDA shared memory and non-shareable memory?

This code is almost the exact same as what's in the CUDA matrix multiplication samples. Although the non-shared memory version has the capability to run at any matrix size, regardless of block size, the shared memory version must work with matrices that are a multiple of the block size (which I set to 4, default was originally 16).

How to perform matrix multiplication using parallel tiled approach?

Optimized Parallel Tiled Approach to perform Matrix Multiplication by taking advantage of the lower latency, higher bandwidth shared memory within GPU thread blocks. Run "make" to build the executable of this file. For debugging, run "make dbg=1" to build a debuggable version of the executable binary.


1 Answers

When the matrix dimensions are not multiples of the tile dimensions, then it can happen that some tiles cover the matrices only partially. The tile elements falling outside the not-fully overlapping tiles should be properly zero-ed. So, extending your code to arbitrarly sized matrices is easy, but does not amount at a simple index check. Below, I'm copying and pasting my version of the tiled matrix-matrix multiplication kernel with arbitrarily sized matrices

__global__ void MatMul(float* A, float* B, float* C, int ARows, int ACols, int BRows,
    int BCols, int CRows, int CCols)
{
    float CValue = 0;

    int Row = blockIdx.y*TILE_DIM + threadIdx.y;
    int Col = blockIdx.x*TILE_DIM + threadIdx.x;

    __shared__ float As[TILE_DIM][TILE_DIM];
    __shared__ float Bs[TILE_DIM][TILE_DIM];

    for (int k = 0; k < (TILE_DIM + ACols - 1)/TILE_DIM; k++) {

         if (k*TILE_DIM + threadIdx.x < ACols && Row < ARows)
             As[threadIdx.y][threadIdx.x] = A[Row*ACols + k*TILE_DIM + threadIdx.x];
         else
             As[threadIdx.y][threadIdx.x] = 0.0;

         if (k*TILE_DIM + threadIdx.y < BRows && Col < BCols)
             Bs[threadIdx.y][threadIdx.x] = B[(k*TILE_DIM + threadIdx.y)*BCols + Col];
         else
             Bs[threadIdx.y][threadIdx.x] = 0.0;

         __syncthreads();

         for (int n = 0; n < TILE_DIM; ++n)
             CValue += As[threadIdx.y][n] * Bs[n][threadIdx.x];

         __syncthreads();
    }

    if (Row < CRows && Col < CCols)
        C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols) +
           (blockIdx.x * blockDim.x)+ threadIdx.x] = CValue;
}
like image 90
Vitality Avatar answered Nov 12 '22 06:11

Vitality