I'm looking for a very bare bones matrix multiplication example for CUBLAS that can multiply M times N and place the results in P for the following code, using high-performance GPU operations:
float M[500][500], N[500][500], P[500][500];
for(int i = 0; i < Width; i++){
for(int j = 0; j < Width; j++)
{
M[i][j] = 500;
N[i][j] = 500;
P[i][j] = 0;
}
}
So far, most code I'm finding to do any kind of matrix multiplication using CUBLAS is (seemingly?) overly complicated.
I am attempting to design a basic lab where students can compare the performance of matrix multiplication on the GPU vs matrix multiplication on the CPU, presumably with increased performance on the GPU.
A 3×3 matrix has three rows and three columns. In matrix multiplication, each of the three rows of first matrix is multiplied by the columns of second matrix and then we add all the pairs.
The cuBLAS Library provides a GPU-accelerated implementation of the basic linear algebra subroutines (BLAS). cuBLAS accelerates AI and HPC applications with drop-in industry standard BLAS APIs highly optimized for NVIDIA GPUs.
The SDK contains matrixMul which illustrates the use of CUBLAS. For a simpler example see the CUBLAS manual section 1.3.
The matrixMul sample also shows a custom kernel, this won't perform as well as CUBLAS of course.
CUBLAS is not necessary to show the GPU outperform the CPU, though CUBLAS would probably outperform it more. It appears that many straightforward CUDA implementations (including matrix multiplication) can outperform the CPU if given a large enough data set, as explained and demonstrated here:
Simplest Possible Example to Show GPU Outperform CPU Using CUDA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With