I'm trying to use the cuBLAS functions in Anaconda's Numba package and having an issue. I need the input matrices to be in C-order. The output can be in Fortran order.
I can run the example script provided with the package, here. The script has two functions, gemm_v1
and gemm_v2
. In gemm_v1
, the user has to create the input matrices in Fortran order. In gemm_v2
, they can be passed to the cuda implementation of GEMM and transposed on the device. I can get these examples to work with square matrices. However, I can't figure out how to get gemm_v2
to work with non-square input matrices. Is there a way to work with C-order input matrices that are non-square?
Note:
Ideally, both the input and output matrices would stay on the device after the call to GEMM to be used in other calculations ( this is part of an iterative method ).
The problem with this example is, that it only works for square matrices. If matrices are not square you cannot calculate A^t*B^t
because of dimension missmatch (assuming the dimensions were right for A*B
).
I don't have a working cuBLAS-installation at hand, so it is kind of a shot in the dark, but I would be really surprised if cuBLAS would work differently than the usual BLAS. BLAS expects matrices to be in column-major-order (aka Fortran-order) but can also be used for matrices in row-major-order (aka C-order).
In my opinion, which might be completely wrong, gemm_v2
is not the usual/best way to handle multiplication of two C-order matrices, for example because if one multiplies two C-order matrices one would also have a C-order matrix as answer.
The trick to calculate product of two C-order-matrices with help of gemm
would work as follows:
Even if it is probably known to you, I would like first to elaborate on the row-major-order (c-memory-layout) and the column-major-order (fortran-memory-layout), in order to flesh out my answer.
So if we have a 2x3
(i.e. 2 rows and 3 columns) matrix A
, and store it in some continuous memory we get:
row-major-order(A) = A11, A12, A13, A21, A22, A23
col-major-order(A) = A11, A21, A12, A22, A13, A33
That means if we get a continuous memory, which represents a matrix in the row-major-order, and interpret it as a matrix in column-major-order we will get quite a different matrix!
However, if we take a look at the transposed matrix A^t
we can easily see:
row-major-order(A) = col-major-order(A^t)
col-major-order(A) = row-major-order(A^t)
That means, if we would like to get the matrix C
in row-major-order as result, the blas-routine should write the transposed matrix C
in column-major-order (after all this we cannot change) into this very memory. However, C^t=(AB)^t=B^t*A^t
and B^t
an A^t
are the original matrices reinterpreted in column-major-order.
Now, let A
be a n x k
-matrix and B
a k x m
-matrix, the call of gemm routine should be as follows:
gemm('N', 'N', m, n, k, 1.0, B, m, A, k, 0.0, C, m)
Please note:
A
and B
, because it is handled by reinterpreting C-order as Fortran-order.A
and B
in order to get C^t
in Fortran-order as result.C
is in C-order (by reinterpreting it from Fortran-order to C-order we get rid of ^t
).If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With