I have two matrices, each of which are MxN
, where M = 16
and N
is much larger (say n = 262144
, for example). My goal is to produce a vector of length N
, where each element corresponds to the dot product of the nth
vector in each of the matrices.
I have attempted the following approach, where cIdx
corresponds to the column index of a column vector in each matrix. Not surprisingly, NVIDIA Visual Profiler is telling me that this approach is mostly memory-bandwidth bound.
public static void MatrixDotProduct(
float* matrix1,
float* matrix2,
float* dotProduct,
int2 matrixDimensions)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
float sum;
for (int cIdx = i; cIdx < matrixDimensions.y; cIdx += stride)
{
int ci = cIdx * matrixDimensions.x;
sum = 0f;
for (int j = 0; j < matrixDimensions.x; j++)
{
sum += matrix1[ci + j] * matrix2[ci + j];
}
dotProduct[cIdx] = sum;
}
}
I have also found a version of a vector-by-vector dot product, which I tried to parallelize as well. Unfortunately, this ran 20% slower than the above implementation (perhaps it's because my M = 16
?). Is there a better way to approach this problem that I'm missing here?
This isn't an easy case to optimise for because of the lack of data reuse and the small vector sizes (less that a warp). The code should be memory bound, and cache performance will be critical.
One thing to try would be to reduce the volume and increase the efficiency of memory transactions by using a vector type for the loads. I would expect something like:
__device__ float vdot(float2 v1, float2 v2) {
return (v1.x * v2.x) + (v1.y * v2.y);
}
__device__ float vdot(float4 v1, float4 v2) {
return (v1.x * v2.x) + (v1.y * v2.y) + (v1.z * v2.z) + (v1.w * v2.w);
}
template<typename VT, int NT>
__device__ float vector_dotprod(const VT* v1, const VT* v2) {
float sum = 0.f;
#pragma unroll
for (int j = 0; j < NT; j++) {
sum += vdot(v1[j], v2[j]);
}
return sum;
}
template<typename VT, int Nrows>
__global__
void MatrixDotProductPlus(float* matrix1, float* matrix2, float* dotProduct, int2 matrixDimensions)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
int stride2 = stride * Nrows;
VT* m1 = reinterpret_cast<VT*>(matrix1) + i * Nrows;
VT* m2 = reinterpret_cast<VT*>(matrix2) + i * Nrows;
for (; i < matrixDimensions.y; i += stride, m1 += stride2, m2 += stride2) {
dotProduct[i] = vector_dotprod<VT,Nrows>(m1, m2);
}
}
[Warning: only very lightly tested -- use at own risk]
to be about twice as fast as your code for the float2
case, and about four times as fast for the float4
case on Maxwell or Pascal architectures for vectors of length 16. This assumes you know the length of the vectors at compile time and they are round multiples of 2 or 4, although I doubt the loop unroll is as critical as the vector type itself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With