Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large performance differences between OS for matrix computation

Tags:

r

On my two computers, I tried this code:

N <- 10e3
M <- 2000
X <- matrix(rnorm(N * M), N)
system.time(crossprod(X))

The first one is a standard laptop and this operation takes 1.7 sec.

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

The second one is a rather good desktop computer and it took 17 seconds.

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

The desktop computer is more performant than the laptop, yet it takes 10 times more time for this matrix computation.

Is the problem coming from the default BLAS/LAPACK used?

like image 261
F. Privé Avatar asked Jun 14 '18 12:06

F. Privé


2 Answers

tldr: CentOS uses single-threaded OpenBLAS, Linux Mint uses Reference BLAS by default but can use other BLAS versions.

The R packages for CentOS available from EPEL depend on openblas-Rblas. This seems to be an OpenBLAS build providing BLAS for R. So while it looks like R's BLAS is used, it actually is OpenBLAS. The LAPACK version is always the one provided by R.

On Debian and derived distributions like Mint, r-base-core depends on

  • libblas3 | libblas.so.3
  • liblapack3 | liblapack.so.3

By default these are provided by the reference implementations libblas3 and liblapack3. These are not particularly fast, but you can replace them easily by installing packages like libopenblas-base. You have control over the BLAS and LAPACK used on your system via update-alternatives.

For controlling the number of threads with OpenBLAS I normally use RhpcBLASctl:

N <- 20000
M <- 2000
X <- matrix(rnorm(N * M), N)
RhpcBLASctl::blas_set_num_threads(2)
system.time(crossprod(X))
#>        User      System verstrichen 
#>       2.492       0.331       1.339
RhpcBLASctl::blas_set_num_threads(1)
system.time(crossprod(X))
#>        User      System verstrichen 
#>       2.319       0.052       2.316

For some reason setting the environment variables OPENBLAS_NUM_THREADS, GOTO_NUM_THREADS or OMP_NUM_THREADS from R does not have the desired effect. On CentOS even RhpcBLASctl does not help, since the used OpenBLAS is single-threaded.

like image 199
Ralf Stubner Avatar answered Nov 18 '22 12:11

Ralf Stubner


R is distributed with a default BLAS implementation, but it may not be optimized for your computer. Intalling optimized versions of BLAS through ATLAS or OpenBLAS prior to R as advised on the Installation guide of R is the way to go. If you clik on Download R for Linux and then on debian/. It is said that:

You may want to install the automatically tuned Atlas or the multi-threaded OpenBlas library in order to get higher performance for linear algebra operations

The source of R can be downloaded here and the BLAS implementation is located in R-3.5.0/src/extra/blas. For instance, the Fortran source code of matrix-matrix multiplication dgemm is located in blas.f, along most BLAS routines (in a single file!).

The comments of the function specifies:

 -- Written on 8-February-1989.
 Jack Dongarra, Argonne National Laboratory.
 Iain Duff, AERE Harwell.
 Jeremy Du Croz, Numerical Algorithms Group Ltd.
 Sven Hammarling, Numerical Algorithms Group Ltd.

The very same lines can be found in the netlib implementation of the routine dgemm.

On the contrary, OpenBLAS provides different implementations, one for each kind of processors. See for instance this file devoted to dgemm for haswell microarchitecture. There are calls to prefetcht0 for prefetching and calls to vfmadd231pd, a vectorial FMA SIMD instruction which performs double precision d=a*b+c 4 times at once.

Using an optimized BLAS can save the day. See for instance this benchmark, where netlib's dgemm() lasts 64 seconds where MKL, OpenBLAS or ATLAS dgemm takes less than 4 seconds.

The case of R internal BLAS might be worse than the classical Netlib library. Indeed, as specified in the appendix A.3.1.5 Shared BLAS in the R Installation and Administration:

R offers the option of compiling the BLAS into a dynamic library libRblas stored in R_HOME/lib and linking both R itself and all the add-on packages against that library. .... There may be performance disadvantages in using a shared BLAS. ... However, experiments showed that in many cases using a shared BLAS was as fast, provided high levels of compiler optimization are used.

Looking at the config.site file of R, it is written that the optimization level is '-O2' for g77/gfortran. Therefore, tunning the FFLAGS option could proove useful if the fortran compiler is not g77/gfortran. During the configure step, there should be a line like checking whether we are using the GNU Fortran 77 compiler... yes (line 7521 of the configure file).

like image 22
francis Avatar answered Nov 18 '22 12:11

francis