Something that isn't really mentioned anywhere (at least that I can see) is what library functions are exposed to inline CUDA kernels.
Specifically I'm doing small / stupid matrix multiplications that don't deserve to be individually offloaded to the GPU but am offloading a larger section of the algorithm which includes this multiplication. Noone ever liked using their own linalg functions since someone has always done it better.
TLDR What libraries can I play with while in inline kernels under PyCUDA?
I don't know of any, and I always thought it would be useful to have.
For the size of problems that I usually work with (small matrices and tensors that arise in the finite element method), I just wrote C++ templates to do the operations. Templating the functions allows the compiler to know the trip counts at compile time, and it can unroll loops and keep results or intermediate results in register, which tends to be very efficient for kernel throughput. So the matrix-matrix product gets declared as
template < typename Real, unsigned int l, unsigned int m, unsigned int n >
__device__ __host__
void matmul(const Real *a,
const Real *b,
Real *c)
{
for(int i=0; i<l; i++) {
for(int j=0; j<n; j++) {
Real dotprod = Real(0);
for(int k=0; k<m; k++) {
dotprod += a[idx2c(i,k,l)] * b[idx2c(k,j,m)];
}
c[idx2c(i,j,l)] = dotprod;
}
}
}
For the sort of sizes that crop up in my kernels (2x2, 3x3, 4x4, 8x8, 9x9), doing the above and letting the compile work things out seems to be as good as any other approach I have tried. Because at the thread level CUDA is effectively scalar, there aren't any vector primitives or stuff like that which can be used to accelerate these sort of small operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With