Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add scalar to vector in BLAS (cuBLAS/CUDA)

Tags:

c

add

cuda

blas

cublas

I don't know if I'm just overlooking something obvious but despite due googling around I see no way to simply add a scalar to a vector (or matrix) using BLAS operations. I'm trying to do this in cuBLAS/CUDA so I'll take any way to accomplish this within that framework. BLAS has <t>scal for scalar multiplication (cublas<t>scal) but where is the analog for addition?! I.e. something analagous to GSL gsl_vector_add_constant. What am I missing?

like image 620
Matt Phillips Avatar asked Dec 27 '12 07:12

Matt Phillips


2 Answers

Probably the only way to do what you are asking is applying axpy with a unit vector of the same size scaled by the constant you want to add.

So the operation becomes X <- X + alpha * I, which is equivalent to adding alpha to each entry in X.


EDIT:

From comments, it seems that you foresee some difficulty in creating the unit vector for the SAXPY call. One way to do so is to use a memset call to set the values of the unit vector on the device, something like this:

#include "cuda.h"
#include "cuda_runtime_api.h"
#include "cublas_v2.h"
#include <iostream>

int main(void)
{

    const int N = 10;
    const size_t sz = sizeof(float) * size_t(N);
    float *A, *I;

    float Ah[N] = { 0., 1., 2., 3., 4., 5., 6., 7., 8., 9. };

    cudaMalloc((void **)&A, sz);
    cudaMemcpy(A, &Ah[0], sz, cudaMemcpyHostToDevice);

    // this creates a bit pattern for a single precision unity value
    // and uses 32-bit memset from the driver API to set the values in the
    // vector.
    const float one = 1.0f;
    const int* one_bits = reinterpret_cast<const int*>(&one);
    cudaMalloc((void **)&I, sz);
    cuMemsetD32(CUdeviceptr(I), *one_bits, N);

    cublasHandle_t h;
    cublasCreate(&h);

    const float alpha = 5.0f;
    cublasSaxpy(h, N, &alpha, I, 1, A, 1);

    cudaMemcpy(&Ah[0], A, sz, cudaMemcpyDeviceToHost);

    for(int i=0; i<N; i++) {
        std::cout << i << " " << Ah[i] << std::endl;
    }

    cublasDestroy(h);
    cudaDeviceReset();

    return 0;
}

Note here I have allocated and copied memory for the CUBLAS vectors using the CUDA runtime API directly, rather than use the CUBLAS helper functions (which are only very thin wrappers around the runtime API calls anyway). The "tricky" part is making a bit pattern and using a driver API memset function to set each 32 bit word of the array.

You could equally do the whole thing with a couple of lines of template code from the thrust library, or just write your own kernel, which could be as simple as

template<typename T>
__global__
void vector_add_constant( T * vector, const T scalar, int N)
{
    int tidx = threadIdx.x + blockIdx.x*blockDim.x;
    int stride = blockDim.x * gridDim.x;

    for(; tidx < N; tidx += stride) {
        vector[tidx] += scalar;
    }
}

[disclaimer: this kernel was written in the browser and is untested. Use as own risk]

like image 64
talonmies Avatar answered Oct 15 '22 06:10

talonmies


Four options, ranked from best to worst:

  • Find the function you need in a different library
  • Implement the function you need yourself
  • Allocate and initialize a constant vector, use that with *axpy.
  • Although strides of zero are formally not supported by the BLAS, some implementations treat a vector with stride 0 as a "scalar" in the sense that you want. Maybe cuBLAS does. However, depending on this is a really bad idea (so bad that I strongly considered not mentioning it), as this behavior is unsupported by the BLAS; your code will not be portable, and it may even be broken by future versions of the library unless nvidia makes a stronger API guarantee than BLAS does.
like image 33
Stephen Canon Avatar answered Oct 15 '22 08:10

Stephen Canon