Given a matrix A
, I need to multiply with other n
vectors Bi
(i.e. i=1...n
). The size of A
can be like 5000x5000
and thus Bi
like 5000x1
.
If I evaluate the product in the following way:
for i=1:n
product=A*Bi;
% do something with product
end
The result is way (orders of magnitude) slower than computing the products like:
%assume that S is a matrix that contains the vectors Bi as columns, i.e. S(:,i)=Bi, then:
results=A*S; %stores all the products in matrix form
% do something with results
The problem is that the number n
of vectors Bi
may be too big to be stored in memory, for example n=300000
, so I need to use a loop approach where each time I evaluate the product, use it and then discard the vector Bi
.
Why is such an approach so slow compared to direct multiplication, and are there ways to overcome this?
You could try loop over batches so for example
for i = 0:(n/k)-1
product = A*S(:,(i*k+1):(i+1)*k)
end
And adjust k
to find the best trade off of speed and memory for you.
MATLAB's loops are slow because it is an interpreted language. So it has to work out a lot of stuff on the fly. The loops are greatly improved these days because of the JIT compiler, but they are still slow compared to the built-in functions which are written and compiled in C. Further to that, they use really cutting edge super fast matrix multiplication algorithms, as compared with your fairly naive algorithm achieved by looping, which also aid in the speed-up you are experiencing.
For simplicity my answer will assume a n-by-n square matrix A, but it is true for non squares as well.
Your loop approach uses matrix vector multiplication. The naive solution is also the best known, resulting in a runtime of O(n^2) which is repeated n times. You end up with a total runtime of O(n^3).
For matrix multiplication, there is a better approach. The best known algorithm only requires little less than O(n^2.4) runtime, making it much faster for large number.
You will achieve a better runtime when multiplying multiple vectors Bi at once using matrix multiplication. This will not achieve the performance of a pure matrix multiplication, but working with larger slices of b is probably the fastest memory efficient solution.
Some Code for the different discussed approaches:
n=5000;
k=100;
A=rand(n,n);
S=rand(n,n);
workers=matlabpool('size');
%for a parfor solution, the batch size must be smaller because multiple batches are stred in memory at once
kparallel=k/workers;
disp('simple loop:');
tic;
for i = 1:n
product = A*S(:,n);
end
toc
disp('batched loop:');
tic;
for i = 1:(n/k)
product = A*S(:,(i-1)*k+1:(i)*k);
end
toc
disp('batched parfor loop:');
tic;
parfor i = 1:(n/kparallel)
product = A*S(:,(i-1)*kparallel+1:(i)*kparallel);
end
toc
disp('matrix multiplication:');
tic;
A*S;
toc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With