Matrix Multiplication using CUDA

Tags:

I am struck up with Matrix multiplication on CUDA. The resultant product matrix is always zero. I have read some sample codes like matrix multiplication in cuda for resolving my problem, but all in vain.

Apart from erratic result of 0, the maximum size of "Width" (code below) is not even 512. I was not able to debug where the problem lies. May be we can discuss it on StackOverflow.

I am referring "Programming Massively Parallel Processors"

#include<cuda.h>
#include<stdio.h>

int main(void) {
    void MatrixMultiplication(float *, float *, float *, int);
    const int Width = 5;
    float M[Width*Width], N[Width*Width], P[Width*Width];
    for(int i = 0; i < (Width*Width) ; i++) {
        M[i] = 5;
        N[i] = 5;
        P[i] = 0;
    }
    MatrixMultiplication(M, N, P, Width);
    for(int i = 0; i < (Width*Width) ; i++) {
        printf("%d \n", P[i]);
    }
    int quit;
    scanf("%d",&quit);
    return 0;
}

//Matrix multiplication kernel - thread specification
__global__ void MatrixMulKernel(float *Md, float *Nd, float *Pd, int Width) {
    //2D Thread ID
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    //Pvalue stores the Pd element that is computed by the thread
    float Pvalue = 0;

    for(int k = 0; k < Width ; ++k) {
        float Mdelement = Md[ty*Width + k];
        float Ndelement = Nd[k*Width + tx];
        Pvalue += (Mdelement*Ndelement);
    }

    Pd[ty*Width + tx] = Pvalue;
}

void MatrixMultiplication(float *M, float *N, float *P, int Width) {
    int size = Width*Width*sizeof(float);
    float *Md, *Nd, *Pd;

    //Transfer M and N to device memory
    cudaMalloc((void**)&Md, size);
    cudaMemcpy(Md,M,size,cudaMemcpyHostToDevice);
    cudaMalloc((void**)&Nd, size);
    cudaMemcpy(Nd,N,size,cudaMemcpyHostToDevice);

    //Allocate P on the device
    cudaMalloc((void**)&Pd,size);

    //Setup the execution configuration
    dim3 dimBlock(Width,Width);
    dim3 dimGrid(1,1);

    //Launch the device computation threads!
    MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);

    //Transfer P from device to host
    cudaMemcpy(P,Pd,size,cudaMemcpyDeviceToHost);

    //Free device matrices
    cudaFree(Md);
    cudaFree(Nd);
    cudaFree(Pd);
}

837

asked Feb 16 '11 20:02

Gaurav Kalra

2 Answers

You were doing fine until this point:

for(int i = 0; i < (Width*Width) ; i++) {
    printf("%d \n", P[i]);
}

I changed it to %f (because it's a float) and they all print nicely :)

$ ./test.exe
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000

answered Sep 28 '22 03:09

ardiyu07

I figured out what was wrong. Let's analyze it :

Point 1 : The quest to remove the ever monotonic "zero value"

As noted, you must replace printf("%d \n", P[i]); as printf("%f \n", P[i]);

Point 2 : Why the program fails with a value of Width 512 ?

Actually it will fail for even a small value such as 23. Why ? Because 23*23 is > 512 (The maximum number of threads that a GPU can have per block as of today!)

answered Sep 28 '22 04:09

Gaurav Kalra

Related questions
                            
                                Empty return in non-void function, is undefined behaviour?
                            
                                Is fmod faster than % for integer modulus calculation
                            
                                Why can't I complete a typedef name of array type?
                            
                                How to tell clang to put debug symbol into executable binaries? [duplicate]
                            
                                C typedef const argument
                            
                                Can preemptive multitasking of native code be implemented in user space on Linux?
                            
                                Why address sanitizer doesn't work for bss global overflow?
                            
                                Support and use of the `fortran` keyword in C
                            
                                Why does my program keep getting stuck while running the mandelbrot brainf*** program?
                            
                                Conversion to void** on different compilers
                            
                                Is it a bad idea to create a generic "function pointer" union in C?
                            
                                How can I keep multiple copies of a very large dataset in memory?
                            
                                When debugging on Windows where does stderr go?
                            
                                Visualization from C/C++ via Gnuplot's pipe interface
                            
                                Cache Simulator in C
                            
                                How can I convert a Unicode path to a c string?
                            
                                Detect desktop environment in Linux programmatically in C
                            
                                How to add a C++ compiler flag to extconf.rb
                            
                                Passing char * vs char ** as parameters to a function in C
                            
                                Free cross-platform library to convert numbers (money amounts) to words? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Matrix Multiplication using CUDA

Tags:

c

cuda

Gaurav Kalra

People also ask

2 Answers

ardiyu07

Gaurav Kalra

Recent Activity

Donate For Us