Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUBLAS - is matrix-element exponentiation possible?

I'm using CUBLAS (Cuda Blas libraries) for matrix operations.

Is possible to use CUBLAS to achieve the exponentiation/root mean square of a matrix items?

I mean, having the 2x2 matrix

1 4
9 16

What I want is a function to elevate to a given value e.g. 2

1 16
81 256

and calculating the root mean square e.g.

1 2
3 4

Is this possible with CUBLAS? I can't find a function suitable to this goal, but I'll ask here first to begin coding my own kernel.

like image 957
Marco A. Avatar asked Mar 27 '11 15:03

Marco A.


1 Answers

So this may well be something you do have to implement yourself, because the library won't do it for you. (There's probably some way to implement it some of it in terms of BLAS level 3 routines - certainly the squaring of the matrix elements - but it would involve expensive and otherwise unnecessary matrix-vector multiplications. And I still don't know how you'd do the squareroot operation). The reason is that these operations aren't really linear-algebra procedures; taking the square root of each matrix element doesn't really correspond to any fundamental linear algebra operation.

The good news is that these elementwise operations are very simple to implement in CUDA. Again, there are lots of tuning options one could play with for best performance, but one can get started fairly easily.

As with the matrix addition operations, you'll be treating the NxM matricies here as (N*M)-length vectors; the structure of the matrix doesn't matter for these elementwise operations. So you'll be passing in a pointer to the first element of the matrix and treating it as a single list of N*M numbers. (I'm going to assume you're using floats here, as you were talking about SGEMM and SAXPY earlier.)

The kernel, the actual bit of CUDA code which implements the operation, is quite simple. For now, each thread will compute the square (or squareroot) of one array element. (Whether this is optimal or not for performance is something you could test). So the kernels would look like the following. I'm assuming you're doing something like B_ij = (A_ij)^2; if you wanted to do the operation inplace, eg A_ij = (A_ij)^2, you could do that, too:

__global__ void squareElements(float *a, float *b, int N) {
    /* which element does this compute? */
    int tid = blockDim.x * blockIdx.x + threadIdx.x;

    /* if valid, squre the array element */
    if (tid < N) 
            b[tid] = (a[tid]*a[tid]);
}

__global__ void sqrtElements(float *a, float *b, int N) {
    /* which element does this compute? */
    int tid = blockDim.x * blockIdx.x + threadIdx.x;

    /* if valid, sqrt the array element */
    if (tid < N) 
            b[tid] = sqrt(a[tid]);   /* or sqrtf() */
}

Note that if you're ok with very slightly increased error, the 'sqrtf()' function which has maximum error of 3 ulp (units in the last place) is significantly faster.

How you call these kernels will depend on the order in which you're doing things. If you've already made some CUBLAS calls on these matricies, you'll want to use them on the arrays which are already in GPU memory.

like image 141
Jonathan Dursi Avatar answered Nov 11 '22 04:11

Jonathan Dursi