Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why numpy/scipy is faster without OpenBLAS?

I made two installations:

  1. brew install numpy (and scipy) --with-openblas
  2. Cloned GIT repositories (for numpy and scipy) and built it myself

After I cloned two handy scripts for verification of these libraries in multi-threaded environment:

git clone https://gist.github.com/3842524.git

Then for each installation I'm executing show_config:

python -c "import scipy as np; np.show_config()"

It's all nice for the installation 1:

lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/opt/openblas/lib']
    language = f77
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/opt/openblas/lib']
    language = f77
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/opt/openblas/lib']
    language = f77
blas_mkl_info:
    NOT AVAILABLE

But installation 2 the things are not so bright:

lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3']
    define_macros = [('NO_ATLAS_INFO', 3)]
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3', '- I/System/Library/Frameworks/vecLib.framework/Headers']
define_macros = [('NO_ATLAS_INFO', 3)]

So it seems when I failed to link OpenBLAS correctly. But it's fine for now, here the performance results. All tests are performed on iMac, Yosemite, i7-4790K, 4 cores, hyper-threaded.

First installation with OpenBLAS:

numpy:

OMP_NUM_THREADS=1 python test_numpy.py
FAST BLAS
version: 1.9.2
maxint: 9223372036854775807
dot: 0.126578998566 sec

OMP_NUM_THREADS=2 python test_numpy.py
FAST BLAS
version: 1.9.2
maxint: 9223372036854775807
dot: 0.0640147686005 sec

OMP_NUM_THREADS=4 python test_numpy.py
FAST BLAS
version: 1.9.2
maxint: 9223372036854775807
dot: 0.0360922336578 sec

OMP_NUM_THREADS=8 python test_numpy.py
FAST BLAS
version: 1.9.2
maxint: 9223372036854775807
dot: 0.0364527702332 sec

scipy:

OMP_NUM_THREADS=1 python test_scipy.py
cholesky: 0.0276656150818 sec
svd: 0.732437372208 sec

OMP_NUM_THREADS=2 python test_scipy.py
cholesky: 0.0182101726532 sec
svd: 0.441690778732 sec

OMP_NUM_THREADS=4 python test_scipy.py
cholesky: 0.0130400180817 sec
svd: 0.316107988358 sec

OMP_NUM_THREADS=8 python test_scipy.py
cholesky: 0.012854385376 sec
svd: 0.315939807892 sec

Second installation without OpenBLAS:

numpy:

OMP_NUM_THREADS=1 python test_numpy.py
slow blas
version: 1.10.0.dev0+3c5409e
maxint: 9223372036854775807
dot: 0.0371072292328 sec

OMP_NUM_THREADS=2 python test_numpy.py
slow blas
version: 1.10.0.dev0+3c5409e
maxint: 9223372036854775807
dot: 0.0215149879456 sec

OMP_NUM_THREADS=4 python test_numpy.py
slow blas
version: 1.10.0.dev0+3c5409e
maxint: 9223372036854775807
dot: 0.0146862030029 sec

OMP_NUM_THREADS=8 python test_numpy.py
slow blas
version: 1.10.0.dev0+3c5409e
maxint: 9223372036854775807
dot: 0.0141334056854 sec

scipy:

OMP_NUM_THREADS=1 python test_scipy.py
cholesky: 0.0109382152557 sec
svd: 0.32529540062 sec

OMP_NUM_THREADS=2 python test_scipy.py
cholesky: 0.00988121032715 sec
svd: 0.331357002258 sec

OMP_NUM_THREADS=4 python test_scipy.py
cholesky: 0.00916676521301 sec
svd: 0.318637990952 sec

OMP_NUM_THREADS=8 python test_scipy.py
cholesky: 0.00931282043457 sec
svd: 0.324427986145 sec

To my surprise, the second case is faster than the first. In case of scipy there is no increase in performance after adding more cores, but even one core is faster than 4 cores in OpenBLAS.

Does anyone have an idea why is that?

like image 669
bigdatarefiner Avatar asked Jan 09 '23 13:01

bigdatarefiner


1 Answers

There are two obvious differences that might account for the discrepancy:

  1. You are comparing two different versions numpy. The OpenBLAS-linked version you installed using Homebrew is 1.9.1, whereas the one you built from source is 1.10.0.dev0+3c5409e.

  2. Whilst the newer version is not linked against OpenBLAS, it is linked against Apple's Accelerate Framework, a different optimized BLAS implementation.


The reason why your test script still reports slow blas for the second case is due to an incompatibility with the newest versions of numpy. The script you are using tests whether numpy is linked against an optimised BLAS library by checking for the presence of numpy.core._dotblas:

try:
    import numpy.core._dotblas
    print 'FAST BLAS'
except ImportError:
    print 'slow blas'

In older versions of numpy, this C module would only be compiled during the installation process if an optimized BLAS library was found. However, _dotblas has been removed altogether in development versions > 1.10.0 (as mentioned in this previous SO question), so the script will always report slow blas for these versions.

I've written an updated version of the numpy test script that reports the BLAS linkage correctly for the latest versions; you can find it here.

like image 132
ali_m Avatar answered Jan 14 '23 15:01

ali_m