I'm implementing an algorithm that, in essence, is a series of matrix-matrix multiplications like this:
Res = M1.M2.M3. ... .Mn
My matrices are really small 100x100 floats, but the sequence is really long, in the order of billions.
I tried using CUBLAS to to the matrix multiplications but this was slow, I did however notice something interesting.
multiplying a 100x100 with a 100x100 matrix was slow, but multiplying a 1.000.000x100 with a 100x100 was relatively fast, this made me think .If I instead of having a scan from left to right had 10.000 scans in parallel. This should be pretty fast, and if I multiplied my matrices when I was done with this, I would get the same result -- just faster.
Res1 = M1.M2.M3. ... .Mn/1000-1 Res1 = M1+n/1000.M2+n/1000.M3+n/1000. ... .M2(n/1000)-1 ... Res1 = M1+999*n/1000.M2+999*n/1000.M3+999*n/1000. ... .M1000*(n/1000)-1 Res = Res1*Res2* ... *Res999
Its worth nothing that M_1 ... M_n are in a set of about 100 different matrices, so space consumption isn't really a problem, all I need to to is be to do multiple multiplies in one operation.
Now here is my problem. I've done a matrix-matrix(sgemm) implementation inspired by the one nvidia demonstrates in their documentation but it is an order of about 4 times as slow as cublas. Do anyone know how CUBLAS works? And if the code is available somewhere?
To multiply one matrix with another, we need to check first, if the number of columns of the first matrix is equal to the number of rows of the second matrix. Now multiply each element of the column of the first matrix with each element of rows of the second matrix and add them all.
Matrix Multiplication (3 x 2) and (2 x 3)Multiplication of 3x2 and 2x3 matrices is possible and the result matrix is a 3x3 matrix.
For example, the 2 × 2 and 2 × 3 matrices of multiplication are possible and the resultant matrix is a 2 × 3 matrix.
Have you looked at the latest CUBLAS (version 4.1)? It includes a new batched GEMM mode specifically intended for large batches of small matrix-matrix multiplies. I would suggest doing a pairwise multiplication tree as Jonathan Dursi suggested in his answer, using the CUBLAS batched API to accelerate it, rather than writing your own custom kernel as he suggests.
CUBLAS 4.1 is included with the CUDA Toolkit v4.1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With