I have a tall matrix (the example below is 10000-by-3000) and I want to take inner products with a subset of its rows (e.g., 500 rows). This is repeated with different, randomly chosen rows for many times (100 times in the example, but in reality many more times). It turns out that the indexing A(sub,:)
is rather slow. In my example, it is better to multiply the full matrix A
(i.e., 10000 rows) instead of selectively choosing and multiplying the 500 that are actually needed.
The random generation of the row indices (sub = randperm(10000);
sub = sub(1:500);
) is cheap computationally; I put it in both loops in order to be fair.
A=randn(10000,3000);
g=zeros(10000,1);
tic
for i=1:100
sub = randperm(10000); sub = sub(1:500);
b=randn(3000,1);
g(sub) = g(sub) + A(sub,:)*b;
end
toc
% elapsed time is 1.58 sec
tic
for i=1:100
sub = randperm(10000); sub = sub(1:500);
b=randn(3000,1);
g = g + A*b;
end
toc
% elapsed time is 1.28 sec
The question is: is there a way to speed up things when only a subset of rows are actually needed?
Try multiplying across the rows instead of the columns. This may require you to rearrange your data or apply a scalar transpose (.'
)or two, but as that is the native form of the arrays you may get a surprising speedup. For example, swapping the dimensions of A
and g
:
A = randn(3000,10000);
g = zeros(1,10000);
tic
for i = 1:100
sub = randperm(10000,500); % Taking @Dan's suggestion
b = randn(1,3000); % b is now a row vector
g(sub) = g(sub)+b*A(:,sub); % multiply across rows instead
end
toc
You can transpose the output if need be. On my computer this is over 50% faster than your first case.
I believe at least one of the underlying reasons for this is that that BLAS/LAPACK can use loop unrolling in this case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With