Doing multiple matrix-matrix multiplications in one operation

I'm implementing an algorithm that, in essence, is a series of matrix-matrix multiplications like this:

Res = M₁.M₂.M₃. ... .M_n

My matrices are really small 100x100 floats, but the sequence is really long, in the order of billions.

I tried using CUBLAS to to the matrix multiplications but this was slow, I did however notice something interesting.

multiplying a 100x100 with a 100x100 matrix was slow, but multiplying a 1.000.000x100 with a 100x100 was relatively fast, this made me think .If I instead of having a scan from left to right had 10.000 scans in parallel. This should be pretty fast, and if I multiplied my matrices when I was done with this, I would get the same result -- just faster.

Res₁ = M₁.M₂.M₃. ... .M_n/1000-1
Res₁ = M_1+n/1000.M_2+n/1000.M_3+n/1000. ... .M_2(n/1000)-1
...
Res₁  = M_1+999*n/1000.M_2+999*n/1000.M_3+999*n/1000. ... .M_{1000*(n/1000)-1}
Res = Res₁*Res₂* ... *Res₉₉₉

Its worth nothing that M_1 ... M_n are in a set of about 100 different matrices, so space consumption isn't really a problem, all I need to to is be to do multiple multiplies in one operation.

Now here is my problem. I've done a matrix-matrix(sgemm) implementation inspired by the one nvidia demonstrates in their documentation but it is an order of about 4 times as slow as cublas. Do anyone know how CUBLAS works? And if the code is available somewhere?

How do you multiply multiple matrices?

To multiply one matrix with another, we need to check first, if the number of columns of the first matrix is equal to the number of rows of the second matrix. Now multiply each element of the column of the first matrix with each element of rows of the second matrix and add them all.

Can we multiply 2 * 3 matrix by 2 * 3?

Matrix Multiplication (3 x 2) and (2 x 3)Multiplication of 3x2 and 2x3 matrices is possible and the result matrix is a 3x3 matrix.

Can you multiply a 2x3 matrix and a 2x2 matrix?

For example, the 2 × 2 and 2 × 3 matrices of multiplication are possible and the resultant matrix is a 2 × 3 matrix.

Have you looked at the latest CUBLAS (version 4.1)? It includes a new batched GEMM mode specifically intended for large batches of small matrix-matrix multiplies. I would suggest doing a pairwise multiplication tree as Jonathan Dursi suggested in his answer, using the CUBLAS batched API to accelerate it, rather than writing your own custom kernel as he suggests.

CUBLAS 4.1 is included with the CUDA Toolkit v4.1.

CUBLAS BATCHED GEMM API IMPROVES PERFORMANCE OF BATCHES OF SMALL MATRICES

Doing multiple matrix-matrix multiplications in one operation

Tags:

c++

c

cuda

blas

cublas

Martin Kristiansen

People also ask

1 Answers

harrism

Recent Activity

Donate For Us

Doing multiple matrix-matrix multiplications in one operation

Tags:

c++

c

cuda

blas

cublas

Martin Kristiansen

People also ask

1 Answers

harrism

Related questions

Recent Activity

Donate For Us