I am curious to why multiplying a sparse-matrix by a dense-matrix takes a different time than the reverse. Are the algorithms significantly different?
Here's an example in matlab 2018a:
a=sprand(M,M,0.01);
b=rand(M);
tic;ref1=a*b;t_axb=toc
tic;ref2=b*a;t_bxa=toc
Here's an example with Eigen 3 and C++ using 1 thread:
//prepare acol=MxM ColMajor Eigen sparse matrix with 0.01 density
...
Map<Matrix<double,M,M,ColMajor> > bcol (PR, M, M );
double tic,toc;
tic=getHighResolutionTime();
result=acol*bcol;
toc=getHighResolutionTime();
printf("\nacol*bcol time: %f seconds", (toc - tic));
tic=getHighResolutionTime();
result=bcol*acol;
toc=getHighResolutionTime();
printf("\nbcol*acol time: %f seconds\n", (toc - tic));
When M=4000, the results are:
t_axb =
0.6877
t_bxa =
0.4803
acol*bcol time: 0.937590 seconds
bcol*acol time: 0.532622 seconds
When M=10000, the results are
t_axb =
11.5649
t_bxa =
9.7872
acol*bcol time: 20.140380 seconds
bcol*acol time: 8.061626 seconds
In both cases, sparse-dense product is slower than dense-sparse product for both Matlab and Eigen. I am curious as to
Why is this the case? Are the algorithms for sparse-dense significantly difference than dense-sparse? The number of FLOPs is the same, right?
Why does eigen match or exceed Matlab's performance for dense-sparse but not for sparse-dense product? A small difference in performance is normal, but a factor of ~1.4-1.8 difference seems strange given that both are highly optimized libraries. I am compiling eigen with all the optimizations as per the documentation. i.e. -fPIC -fomit-frame-pointer -O3 -DNDEBUG -fopenmp -march=native
Using sparse matrices to store data that contains a large number of zero-valued elements can both save a significant amount of memory and speed up the processing of that data.
For what it is worth, for random sparse matrices of size 10,000 by 10,000 vs. dense matrices of the same size, on my Xeon workstation using MATLAB and Intel MKL as the BLAS, the sparse matrix-vector multiply was faster for densities of 15% or less.
Operations using standard dense-matrix structures and algorithms are slow and inefficient when applied to large sparse matrices as processing and memory are wasted on the zeros. Sparse data is by nature more easily compressed and thus requires significantly less storage.
A matrix that has been compressed to eliminate zero-values is a sparse matrix, whereas a matrix with zero and nonzero values is a dense matrix.
You could observe the same difference by comparing column-major versus row-major storage for sparse-matrix times vector product: y = A * x
. If A
is row-major (equivalently each coeff of y
), then each row of A
can be treated in parallel without any overhead (no communication, no additional temporary, no additional operation). In contrast, if A
is column-major multi-threading cannot come for free, and in most cases the overhead is larger than the gain.
Even without multi-threading, you see that the memory access patterns are very different:
x
, each coeff of y
being write one only.x
is read once, but we get multiple random read-write accesses to the destination y
.So even without multi-threading the situation is naturally favorable to row-major.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With