Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to transpose a matrix in CUDA/cublas?

Say I have a matrix with a dimension of A*B on GPU, where B (number of columns) is the leading dimension assuming a C style. Is there any method in CUDA (or cublas) to transpose this matrix to FORTRAN style, where A (number of rows) becomes the leading dimension?

It is even better if it could be transposed during host->device transfer while keep the original data unchanged.

like image 697
Hailiang Zhang Avatar asked Dec 08 '12 21:12

Hailiang Zhang


People also ask

What is cuBLAS in Cuda?

The cuBLAS Library provides a GPU-accelerated implementation of the basic linear algebra subroutines (BLAS). cuBLAS accelerates AI and HPC applications with drop-in industry standard BLAS APIs highly optimized for NVIDIA GPUs.

Which method is used to transpose a matrix in python?

NumPy Matrix transpose() We can use the transpose() function to get the transpose of an array.

How do you transpose a matrix in a for loop in Python?

To transpose a matrix in Python, FOR loop is used. Each element of the matrix is iterated and placed at the respective place in the transpose. The placing is done such that the element at the ith row and jth column is placed at the jth row and ith column.


2 Answers

as asked within the title, to transpose a device row-major matrix A[m][n], one can do it this way:

    float* clone = ...;//copy content of A to clone
    float const alpha(1.0);
    float const beta(0.0);
    cublasHandle_t handle;
    cublasCreate(&handle);
    cublasSgeam( handle, CUBLAS_OP_T, CUBLAS_OP_N, m, n, &alpha, clone, n, &beta, clone, m, A, m );
    cublasDestroy(handle);

And, to multiply two row-major matrices A[m][k] B[k][n], C=A*B

    cublasSgemm( handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &alpha, B, n, A, k, &beta, C, n );

where C is also a row-major matrix.

like image 104
Feng Wang Avatar answered Sep 18 '22 21:09

Feng Wang


The CUDA SDK includes a matrix transpose, you can see here examples of code on how to implement one, ranging from a naive implementation to optimized versions.

For example:

Naïve transpose

__global__ void transposeNaive(float *odata, float* idata,
int width, int height, int nreps)
{
    int xIndex = blockIdx.x*TILE_DIM + threadIdx.x;
    int yIndex = blockIdx.y*TILE_DIM + threadIdx.y;
    int index_in = xIndex + width * yIndex;
    int index_out = yIndex + height * xIndex;

    for (int r=0; r < nreps; r++)
    {
        for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS)
        {
          odata[index_out+i] = idata[index_in+i*width];
        }
    }
}

Like talonmies had point out you can specify if you want operate the matrix as transposed or not, in cublas matrix operations eg.: for cublasDgemm() where C = a * op(A) * op(B) + b * C, assuming you want to operate A as transposed (A^T), on the parameters you can specify if it is ('N' normal or 'T' transposed)

like image 45
dreamcrash Avatar answered Sep 20 '22 21:09

dreamcrash