Say I have a matrix with a dimension of <code>A*B</code> on GPU, where <code>B</code> (number of columns) is the leading dimension assuming a C style. Is there any method in CUDA (or cublas) to transpose this matrix to FORTRAN style, where <code>A</code> (number of rows) becomes the leading dimension? It is even better if it could be transposed during <code>host->device</code> transfer while keep the original data unchanged.

as asked within the title, to transpose a device row-major matrix A[m][n], one can do it this way: <pre class="prettyprint"><code> float* clone = ...;//copy content of A to clone float const alpha(1.0); float const beta(0.0); cublasHandle_t handle; cublasCreate(&handle); cublasSgeam( handle, CUBLAS_OP_T, CUBLAS_OP_N, m, n, &alpha, clone, n, &beta, clone, m, A, m ); cublasDestroy(handle); </code></pre> And, to multiply two row-major matrices A[m][k] B[k][n], C=A*B <pre class="prettyprint"><code> cublasSgemm( handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &alpha, B, n, A, k, &beta, C, n ); </code></pre> where C is also a row-major matrix.

How to transpose a matrix in CUDA/cublas?

Tags:

c

parallel-processing

cuda

gpu

cublas

Say I have a matrix with a dimension of A*B on GPU, where B (number of columns) is the leading dimension assuming a C style. Is there any method in CUDA (or cublas) to transpose this matrix to FORTRAN style, where A (number of rows) becomes the leading dimension?

It is even better if it could be transposed during host->device transfer while keep the original data unchanged.

697

asked Dec 08 '12 21:12

Hailiang Zhang

2 Answers

as asked within the title, to transpose a device row-major matrix A[m][n], one can do it this way:

Click to copy

    float* clone = ...;//copy content of A to clone
    float const alpha(1.0);
    float const beta(0.0);
    cublasHandle_t handle;
    cublasCreate(&handle);
    cublasSgeam( handle, CUBLAS_OP_T, CUBLAS_OP_N, m, n, &alpha, clone, n, &beta, clone, m, A, m );
    cublasDestroy(handle);

And, to multiply two row-major matrices A[m][k] B[k][n], C=A*B

Click to copy

    cublasSgemm( handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &alpha, B, n, A, k, &beta, C, n );

where C is also a row-major matrix.

104

answered Sep 18 '22 21:09

Feng Wang

The CUDA SDK includes a matrix transpose, you can see here examples of code on how to implement one, ranging from a naive implementation to optimized versions.

For example:

Naïve transpose

Click to copy

__global__ void transposeNaive(float *odata, float* idata,
int width, int height, int nreps)
{
    int xIndex = blockIdx.x*TILE_DIM + threadIdx.x;
    int yIndex = blockIdx.y*TILE_DIM + threadIdx.y;
    int index_in = xIndex + width * yIndex;
    int index_out = yIndex + height * xIndex;

    for (int r=0; r < nreps; r++)
    {
        for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS)
        {
          odata[index_out+i] = idata[index_in+i*width];
        }
    }
}

Like talonmies had point out you can specify if you want operate the matrix as transposed or not, in cublas matrix operations eg.: for cublasDgemm() where C = a * op(A) * op(B) + b * C, assuming you want to operate A as transposed (A^T), on the parameters you can specify if it is ('N' normal or 'T' transposed)

answered Sep 20 '22 21:09

dreamcrash

Related questions
                            
                                sys/sendfile.h not found GCC
                            
                                Loading PE Headers
                            
                                GCC compiler and converting const char* to char *
                            
                                Returning a string from a function in C
                            
                                Difference between char* and char[] in C [duplicate]
                            
                                What does the (*) in (int (*)[30]) mean?
                            
                                Socket programming issue with recv() receiving partial messages
                            
                                Why does gcc create redundant assembly code?
                            
                                order of evaluation of || and && in c
                            
                                Hardware inspired loop. Nonsense?
                            
                                dos.h for Linux?
                            
                                Using C++ mangled functions from C
                            
                                Why is ~0xF equal to 0xFFFFFFF0 on a 32-bit machine?
                            
                                Splitting a hex number
                            
                                Printing #define'd constants
                            
                                Create a Fast Sin() function to improve fps ? Fast sin() function?
                            
                                C; If I put a break in a for loop inside a while loop
                            
                                Is there any scenario where function fma in libc can be used?
                            
                                Which of the following combinations of post & pre-increment operators have undefined behaviour in C?
                            
                                double pointer vs single pointer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to transpose a matrix in CUDA/cublas?

Tags:

c

parallel-processing

cuda

gpu

cublas

Hailiang Zhang

People also ask

2 Answers

Feng Wang

dreamcrash

Recent Activity

Donate For Us