Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to perform vector-by-vector dot products for two MxN matrices with small M in CUDA?

I have two matrices, each of which are MxN, where M = 16 and N is much larger (say n = 262144, for example). My goal is to produce a vector of length N, where each element corresponds to the dot product of the nth vector in each of the matrices.

I have attempted the following approach, where cIdx corresponds to the column index of a column vector in each matrix. Not surprisingly, NVIDIA Visual Profiler is telling me that this approach is mostly memory-bandwidth bound.

    public static void MatrixDotProduct(
        float* matrix1,
        float* matrix2,
        float* dotProduct,
        int2 matrixDimensions)
    {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        int stride = gridDim.x * blockDim.x;
        float sum;

        for (int cIdx = i; cIdx < matrixDimensions.y; cIdx += stride)
        {
            int ci = cIdx * matrixDimensions.x;
            sum = 0f;

            for (int j = 0; j < matrixDimensions.x; j++)
            {
                sum += matrix1[ci + j] * matrix2[ci + j];
            }

            dotProduct[cIdx] = sum;
        }
    }

I have also found a version of a vector-by-vector dot product, which I tried to parallelize as well. Unfortunately, this ran 20% slower than the above implementation (perhaps it's because my M = 16?). Is there a better way to approach this problem that I'm missing here?

like image 511
ArKi Avatar asked Mar 09 '23 23:03

ArKi


1 Answers

This isn't an easy case to optimise for because of the lack of data reuse and the small vector sizes (less that a warp). The code should be memory bound, and cache performance will be critical.

One thing to try would be to reduce the volume and increase the efficiency of memory transactions by using a vector type for the loads. I would expect something like:

__device__ float vdot(float2 v1, float2 v2) {
    return (v1.x * v2.x) + (v1.y * v2.y);
}

__device__ float vdot(float4 v1, float4 v2) {
    return (v1.x * v2.x) + (v1.y * v2.y) + (v1.z * v2.z) + (v1.w * v2.w);
}

template<typename VT, int NT>
__device__ float vector_dotprod(const VT* v1, const VT* v2) {
    float sum = 0.f;
#pragma unroll
    for (int j = 0; j < NT; j++) {
        sum += vdot(v1[j], v2[j]);
    }
    return sum;
}

template<typename VT, int Nrows>
__global__
void MatrixDotProductPlus(float* matrix1, float* matrix2, float* dotProduct, int2 matrixDimensions)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;
    int stride2 = stride * Nrows;

    VT* m1 = reinterpret_cast<VT*>(matrix1) + i * Nrows;
    VT* m2 = reinterpret_cast<VT*>(matrix2) + i * Nrows;
    for (; i < matrixDimensions.y; i += stride, m1 += stride2, m2 += stride2) {
        dotProduct[i] = vector_dotprod<VT,Nrows>(m1, m2);
    }
}

[Warning: only very lightly tested -- use at own risk]

to be about twice as fast as your code for the float2 case, and about four times as fast for the float4 case on Maxwell or Pascal architectures for vectors of length 16. This assumes you know the length of the vectors at compile time and they are round multiples of 2 or 4, although I doubt the loop unroll is as critical as the vector type itself.

like image 181
4 revs Avatar answered Apr 26 '23 08:04

4 revs