I observed that matrix by vector multiplication with a large matrix is significantly slower in R (version 3.6.1) than in Matlab (version 2019b), while both languages rely on the (same?) BLAS library. See below a minimal example:
In Matlab:
n=900;
p=900;
A=reshape(1:(n*p),[n,p]);
x=ones(p,1);
tic()
for id = 1:1000
x = A*x;
end
toc()
In R:
n=900
p=900
A=matrix(c(1:(n*p)),nrow=n,ncol=p)
x=rep(1,ncol(A))
t0 <- Sys.time()
for(iter in 1:1000){
x = A%*%x
}
t1 <- Sys.time()
print(t1-t0)
I get a running execution time of roughly 0.05sec in Matlab versus 3.5sec in R using the same computer. Any idea of the reason for such difference?
Thanks.
[EDIT]: I add below a similar calculus in C (using the CBLAS library, compilation using gcc cblas_dgemv.c -lblas -o cblas_dgemv
, where cblas_dgemv.c denotes the source file below). I get a running time of roughly 0.08s which is quite close to the running times obtained using Matlab (0.05s). I am still trying to figure out the reason of this huge running time in R (3.5s).
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include "cblas.h"
int main(int argc, char **argv)
{
int m=900,adr;
double *A,*x,*y;
struct timeval t0,t1;
/* memory allocation and initialization */
A = (double*)malloc(m*m*sizeof(double));
x = (double*)malloc(m*sizeof(double));
y = (double*)malloc(m*sizeof(double));
for(adr=0;adr<m;adr++) x[adr] = 1.;
for(adr=0;adr<m*m;adr++) A[adr] = adr;
/* main loop */
gettimeofday(&t0, NULL);
for(adr=0;adr<1000;adr++)
cblas_dgemv(CblasColMajor,CblasNoTrans,m,m,1.,A,m,x,1,0.,y,1);
gettimeofday(&t1, NULL);
printf("elapsed time = %.2e seconds\n",(double)(t1.tv_usec-t0.tv_usec)/1000000. + (double)(t1.tv_sec-t0.tv_sec));
/* free memory */
free(A);
free(x);
free(y);
return EXIT_SUCCESS;
}
Notice that I was not able to set y=x in the cblas_dgemv routine. Thus, this C calculus is slightly different from that done in R and Matlab codes above. However the compilation was done without optimization flag (no option -O3) and I checked that the matrix-vector product was indeed called at each iteration of the loop (performing 10x more iterations leads to a 10x longer running time).
Because MATLAB is a programming language at first developed for numerical linear algebra (matrix manipulations), which has libraries especially developed for matrix multiplications.
A vector is a 1-dimensional matrix, either a vertical vector (N × 1) or horizontal vector (1 × N). Vectors are a subclass of matrices, so every vector is a matrix. xL and xU are horizontal (1 × N) vectors and therefore they are also matrices. ib.
Matrix operations follow the rules of linear algebra. By contrast, array operations execute element by element operations and support multidimensional arrays. The period character ( . ) distinguishes the array operations from the matrix operations.
A matrix is a 2-dimensional structure whereas a vector is a one-dimensional structure. In this article, we are going to multiply the given matrix by the given vector using R Programming Language. Multiplication between the two occurs when vector elements are multiplied with matrix elements column-wise. Display result.
we can use sweep () method to multiply vectors to a matrix. sweep () function is used to apply the operation “+ or – or ‘*’ or ‘/’ ” to the row or column in the given matrix.
Create a row vector a and a column vector b, then multiply them. The 1-by-3 row vector and 4-by-1 column vector combine to produce a 4-by-3 matrix.
Matrix multiplication is not universally commutative for nonscalar inputs. That is, A*B is typically not equal to B*A. If at least one input is scalar, then A*B is equivalent to A.*B and is commutative. C = mtimes (A,B) is an alternative way to execute A*B, but is rarely used. It enables operator overloading for classes.
Here's something kind of shocking:
The precompiled R distribution that is downloaded from CRAN makes use of the reference BLAS/LAPACK implementation for linear algebra operations
The "reference BLAS" is the non-optimized, non-accelerated BLAS, unlike OpenBLAS or Intel MKL. Matlab uses MKL, which is accelerated.
This seems to be confirmed by sessionInfo()
in my R 3.6.0 on macOS:
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
If I'm reading this right, that means that by default, R uses a slow BLAS, and if you want it to go fast, you need to do some configuration to get it to use a fast BLAS instead.
This is a little surprising to me. As I understand it, the reference BLAS is generally primarily used for testing and development, not for "actual work".
I get about the same timings as you in R 3.6 vs Matlab R2019b on macOS 10.14: 0.04 seconds in Matlab, 4.5 seconds in R. I think that's consistent with R using a non-accelerated BLAS.
I used Rcpp to interface a C++ implementation of the matrix-vector product with R.
Source C++ file ('cblas_dgemv2.cpp'): performes 'niter' times the product y = A*x
#include <cblas-openblas.h>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
Rcpp::NumericVector cblas_dgemv2(Rcpp::NumericMatrix A, Rcpp::NumericVector x, int niter)
{
int m=A.ncol(),iter;
Rcpp::NumericVector y(m);
for(iter=0;iter<niter;iter++)
cblas_dgemv(CblasColMajor,CblasNoTrans,m,m,1.,A.begin(),m,x.begin(),1,0.,y.begin(),1);
return y;
}
Then I perform two experiments using the R code below:
y=cblas_dgmev2(A,x,1000)
to perform 1000 times in C++ the computation of the product y=Ax*; y=cblas_dgemv2(A,x,1)
, each call performs in C++ the product y=A*x. # compile cblas_dgemv2 (you may need to update the path to the library)
PKG_LIBS <- '/usr/lib/x86_64-linux-gnu/libopenblas.a'
PKG_CPPFLAGS <- '-I/usr/include/x86_64-linux-gnu'
Sys.setenv(PKG_LIBS = PKG_LIBS , PKG_CPPFLAGS = PKG_CPPFLAGS)
Rcpp::sourceCpp('cblas_dgemv2.cpp', rebuild = TRUE)
# create A and x
n=900
A=matrix(c(1:(n*n)),nrow=n,ncol=n)
x=rep(1,n)
# Experiment 1: call 1 time cblas_dgemv2 (with 1000 iterations in C++)
t0 <- Sys.time()
y=cblas_dgemv2(A,x,1000) # perform 1000 times the computation y = A%*%x
t1 <- Sys.time()
print(t1-t0)
# Experiment 2: call 1000 times cblas_dgemv2
t0 <- Sys.time()
for(iter in 1:1000){
y=cblas_dgemv2(A,x,1) # perform 1 times the computation y = A%*%x
}
t1 <- Sys.time()
print(t1-t0)
I get a running time of 0.08s for the first experiment versus 4.8s for the second experiment.
My conclusion: the bottleneck in terms of the running times comes from the data transferts between R and C++, not from the computation of the matrix-vector product itself. Surprising, isn't it?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With