I am struck up with Matrix multiplication on CUDA. The resultant product matrix is always zero. I have read some sample codes like matrix multiplication in cuda for resolving my problem, but all in vain.
Apart from erratic result of 0, the maximum size of "Width" (code below) is not even 512. I was not able to debug where the problem lies. May be we can discuss it on StackOverflow.
I am referring "Programming Massively Parallel Processors"
#include<cuda.h>
#include<stdio.h>
int main(void) {
void MatrixMultiplication(float *, float *, float *, int);
const int Width = 5;
float M[Width*Width], N[Width*Width], P[Width*Width];
for(int i = 0; i < (Width*Width) ; i++) {
M[i] = 5;
N[i] = 5;
P[i] = 0;
}
MatrixMultiplication(M, N, P, Width);
for(int i = 0; i < (Width*Width) ; i++) {
printf("%d \n", P[i]);
}
int quit;
scanf("%d",&quit);
return 0;
}
//Matrix multiplication kernel - thread specification
__global__ void MatrixMulKernel(float *Md, float *Nd, float *Pd, int Width) {
//2D Thread ID
int tx = threadIdx.x;
int ty = threadIdx.y;
//Pvalue stores the Pd element that is computed by the thread
float Pvalue = 0;
for(int k = 0; k < Width ; ++k) {
float Mdelement = Md[ty*Width + k];
float Ndelement = Nd[k*Width + tx];
Pvalue += (Mdelement*Ndelement);
}
Pd[ty*Width + tx] = Pvalue;
}
void MatrixMultiplication(float *M, float *N, float *P, int Width) {
int size = Width*Width*sizeof(float);
float *Md, *Nd, *Pd;
//Transfer M and N to device memory
cudaMalloc((void**)&Md, size);
cudaMemcpy(Md,M,size,cudaMemcpyHostToDevice);
cudaMalloc((void**)&Nd, size);
cudaMemcpy(Nd,N,size,cudaMemcpyHostToDevice);
//Allocate P on the device
cudaMalloc((void**)&Pd,size);
//Setup the execution configuration
dim3 dimBlock(Width,Width);
dim3 dimGrid(1,1);
//Launch the device computation threads!
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);
//Transfer P from device to host
cudaMemcpy(P,Pd,size,cudaMemcpyDeviceToHost);
//Free device matrices
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);
}
In your case of matrix multiplication. You can parallelize the computations, Because GPU have much more threads and in each thread you have multiple blocks. So a lot of computations are parallelized, resulting quick computations.
For matrices of size 800×800, the GPU implementation is more than 3.5 times faster than the CPU one. Furthermore, the usage of the shared memory in the GPU implementation further speeds up the execution dramatically. On our input data this fast GPU implementation was up to 7.5 times faster than the CPU implementation.
In linear algebra, the Strassen algorithm, named after Volker Strassen, is an algorithm for matrix multiplication. It is faster than the standard matrix multiplication algorithm for large matrices, with a better asymptotic complexity, although the naive algorithm is often better for smaller matrices.
You were doing fine until this point:
for(int i = 0; i < (Width*Width) ; i++) {
printf("%d \n", P[i]);
}
I changed it to %f (because it's a float) and they all print nicely :)
$ ./test.exe
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
I figured out what was wrong. Let's analyze it :
Point 1 : The quest to remove the ever monotonic "zero value"
As noted, you must replace printf("%d \n", P[i]);
as printf("%f \n", P[i]);
Point 2 : Why the program fails with a value of Width 512 ?
Actually it will fail for even a small value such as 23. Why ? Because 23*23 is > 512 (The maximum number of threads that a GPU can have per block as of today!)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With