I don't know if I'm just overlooking something obvious but despite due googling around I see no way to simply add a scalar to a vector (or matrix) using BLAS operations. I'm trying to do this in cuBLAS/CUDA so I'll take any way to accomplish this within that framework. BLAS has <t>scal
for scalar multiplication (cublas<t>scal
) but where is the analog for addition?! I.e. something analagous to GSL gsl_vector_add_constant
. What am I missing?
Probably the only way to do what you are asking is applying axpy
with a unit vector of the same size scaled by the constant you want to add.
So the operation becomes X <- X + alpha * I
, which is equivalent to adding alpha
to each entry in X
.
EDIT:
From comments, it seems that you foresee some difficulty in creating the unit vector for the SAXPY call. One way to do so is to use a memset call to set the values of the unit vector on the device, something like this:
#include "cuda.h"
#include "cuda_runtime_api.h"
#include "cublas_v2.h"
#include <iostream>
int main(void)
{
const int N = 10;
const size_t sz = sizeof(float) * size_t(N);
float *A, *I;
float Ah[N] = { 0., 1., 2., 3., 4., 5., 6., 7., 8., 9. };
cudaMalloc((void **)&A, sz);
cudaMemcpy(A, &Ah[0], sz, cudaMemcpyHostToDevice);
// this creates a bit pattern for a single precision unity value
// and uses 32-bit memset from the driver API to set the values in the
// vector.
const float one = 1.0f;
const int* one_bits = reinterpret_cast<const int*>(&one);
cudaMalloc((void **)&I, sz);
cuMemsetD32(CUdeviceptr(I), *one_bits, N);
cublasHandle_t h;
cublasCreate(&h);
const float alpha = 5.0f;
cublasSaxpy(h, N, &alpha, I, 1, A, 1);
cudaMemcpy(&Ah[0], A, sz, cudaMemcpyDeviceToHost);
for(int i=0; i<N; i++) {
std::cout << i << " " << Ah[i] << std::endl;
}
cublasDestroy(h);
cudaDeviceReset();
return 0;
}
Note here I have allocated and copied memory for the CUBLAS vectors using the CUDA runtime API directly, rather than use the CUBLAS helper functions (which are only very thin wrappers around the runtime API calls anyway). The "tricky" part is making a bit pattern and using a driver API memset function to set each 32 bit word of the array.
You could equally do the whole thing with a couple of lines of template code from the thrust library, or just write your own kernel, which could be as simple as
template<typename T>
__global__
void vector_add_constant( T * vector, const T scalar, int N)
{
int tidx = threadIdx.x + blockIdx.x*blockDim.x;
int stride = blockDim.x * gridDim.x;
for(; tidx < N; tidx += stride) {
vector[tidx] += scalar;
}
}
[disclaimer: this kernel was written in the browser and is untested. Use as own risk]
Four options, ranked from best to worst:
*axpy
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With