Why is it that the matrix multiplication with Numpy is much faster than <code>gsl_blas_sgemm</code> from GSL, for instance: <pre class="prettyprint lang-py prettyprint-override"><code>import numpy as np import time N = 1000 M = np.zeros(shape=(N, N), dtype=np.float) for i in range(N): for j in range(N): M[i, j] = 0.23 + 100*i + j tic = time.time() np.matmul(M, M) toc = time.time() print(toc - tic) </code></pre> gives something between 0.017 - 0.019 seconds, while in C++: <pre class="prettyprint lang-cpp prettyprint-override"><code>#include <chrono> #include <iostream> #include <gsl/gsl_matrix.h> #include <gsl/gsl_blas.h> using namespace std::chrono; int main(void) { int N = 1000; gsl_matrix_float* M = gsl_matrix_float_alloc(N, N); for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { gsl_matrix_float_set(M, i, j, 0.23 + 100 * i + j); } } gsl_matrix_float* C = gsl_matrix_float_alloc(N, N); // save the result into C auto start = high_resolution_clock::now(); gsl_blas_sgemm(CblasNoTrans, CblasNoTrans, 1.0, M, M, 0.0, C); auto stop = high_resolution_clock::now(); auto duration = duration_cast<milliseconds>(stop - start); std::cout << duration.count() << std::endl; return 0; } </code></pre> I get a runtime of the multiplication of about 2.7 seconds. I am also compiling with the maximum speed option <code>/02</code>. I am working with Visual Studio. I must do something very wrong. I was not expecting a much better performance from the C++ code because I am aware that Numpy is optimized C-Code, but neither was I expecting it to be about 150 times slower than python. Why is that? How can I improve the runtime of the multiplication relative to Numpy? Background of the problem: I need to evaluate an 1000 to 2000 dimensional integral, and I am doing it with the Monte-Carlo method. For that I wrote almost the whole integrand as Numpy array operations, this works quite fast but i need it even faster in order to evaluate the same integrand 100.000 to 500.000 times, so any little improvement would help. Does it make sense to write the same code in C/C++ or should I stick to Numpy? Thanks!

TL;DR: the C++ code and Numpy do not use the same matrix-multiplication library. The matrix multiplication of the GSL library is not optimized. On my machine, it runs sequentially, does not use SIMD instructions (SSE/AVX), does not efficiently unroll the loops to perform register tiling. I also suspect it also does not use the CPU cache efficiently due to the lack of tiling. These optimizations are critical to achieve high-performance and widely used in fast linear algebra libraries. Numpy uses a BLAS library installed on your machine. On many Linux platform, its uses OpenBLAS or the Intel MKL. Both are very fast (they use all the methods described above) and should run in parallel. You can find which implementation of BLAS is used by Numpy here. On my Linux machine, Numpy use by default CBLAS which internally use OpenBLAS (OpenBLAS is strangely not directly detected by Numpy). There are many fast parallel BLAS implementations (GotoBLAS, ATLAS, BLIS, etc.). The open-source BLIS library is great because its matrix multiplication is very fast on many different architectures. As a result, the simplest way to improve your C++ code is to use the <code>cblas_sgemm</code> CBLAS function and link a fast BLAS library like OpenBLAS or BLIS for example. <hr> For more information: One simple way to see how bad the GSL perform is to use a profiler (like perf on Linux or VTune on Windows). In your case Linux perf, report that >99% of the time is spent in <code>libgslcblas.so</code> (ie. the GSL library). More specifically, most of the execution time is spent in this following assembly loop: <pre class="prettyprint lang-none prettyprint-override"><code>250: movss (%rdx),%xmm1 add $0x4,%rax add $0x4,%rdx mulss %xmm2,%xmm1 # scalar instructions addss -0x4(%rax),%xmm1 movss %xmm1,-0x4(%rax) cmp %rax,%r9 ↑ jne 250 </code></pre> As for Numpy, 99% of its time is spent in <code>libopenblasp-r0.3.13.so</code> (ie. the OpenBLAS library). More specifically in the following assembly code of the function <code>dgemm_kernel_HASWELL</code>: <pre class="prettyprint lang-none prettyprint-override"><code>110: lea 0x80(%rsp),%rsi add $0x60,%rsi mov %r12,%rax sar $0x3,%rax cmp $0x2,%rax ↓ jl d26 prefetcht0 0x200(%rdi) # Data prefetching vmovups -0x60(%rsi),%ymm1 prefetcht0 0xa0(%rsi) vbroadcastsd -0x80(%rdi),%ymm0 # Fast SIMD instruction (AVX) prefetcht0 0xe0(%rsi) vmovups -0x40(%rsi),%ymm2 prefetcht0 0x120(%rsi) vmovups -0x20(%rsi),%ymm3 vmulpd %ymm0,%ymm1,%ymm4 prefetcht0 0x160(%rsi) vmulpd %ymm0,%ymm2,%ymm8 vmulpd %ymm0,%ymm3,%ymm12 prefetcht0 0x1a0(%rsi) vbroadcastsd -0x78(%rdi),%ymm0 vmulpd %ymm0,%ymm1,%ymm5 vmulpd %ymm0,%ymm2,%ymm9 [...] </code></pre> We can clearly see that the GSL code is not optimized (because of the scalar code and the naive simple loop) and that OpenBLAS code is optimized as it uses at least wide SIMD instructions, data prefetching and loop unroling. Note that the executed OpenBLAS code is not optimal as it could use the FMA instructions available on my processor.

Why is the GNU scientific library matrix multiplication slower than numpy.matmul?

Tags:

c++

performance

python

numpy

gsl

Why is it that the matrix multiplication with Numpy is much faster than gsl_blas_sgemm from GSL, for instance:

import numpy as np
import time 


N = 1000
M = np.zeros(shape=(N, N), dtype=np.float)

for i in range(N):
    for j in range(N):
        M[i, j] = 0.23 + 100*i + j

tic = time.time()
np.matmul(M, M)
toc = time.time()
print(toc - tic)

gives something between 0.017 - 0.019 seconds, while in C++:

#include <chrono>
#include <iostream>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_blas.h>

using namespace std::chrono;

int main(void) {

    int N = 1000;

    gsl_matrix_float* M = gsl_matrix_float_alloc(N, N);
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            gsl_matrix_float_set(M, i, j, 0.23 + 100 * i + j);
        }
    }

    gsl_matrix_float* C = gsl_matrix_float_alloc(N, N); // save the result into C

    auto start = high_resolution_clock::now();

    gsl_blas_sgemm(CblasNoTrans, CblasNoTrans, 1.0, M, M, 0.0, C);

    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<milliseconds>(stop - start);
    std::cout << duration.count() << std::endl;

    return 0;
}

I get a runtime of the multiplication of about 2.7 seconds. I am also compiling with the maximum speed option /02. I am working with Visual Studio. I must do something very wrong. I was not expecting a much better performance from the C++ code because I am aware that Numpy is optimized C-Code, but neither was I expecting it to be about 150 times slower than python. Why is that? How can I improve the runtime of the multiplication relative to Numpy?

Background of the problem: I need to evaluate an 1000 to 2000 dimensional integral, and I am doing it with the Monte-Carlo method. For that I wrote almost the whole integrand as Numpy array operations, this works quite fast but i need it even faster in order to evaluate the same integrand 100.000 to 500.000 times, so any little improvement would help. Does it make sense to write the same code in C/C++ or should I stick to Numpy? Thanks!

516

asked May 15 '21 16:05

ibroketheinternet

1 Answers

TL;DR: the C++ code and Numpy do not use the same matrix-multiplication library.

The matrix multiplication of the GSL library is not optimized. On my machine, it runs sequentially, does not use SIMD instructions (SSE/AVX), does not efficiently unroll the loops to perform register tiling. I also suspect it also does not use the CPU cache efficiently due to the lack of tiling. These optimizations are critical to achieve high-performance and widely used in fast linear algebra libraries.

Numpy uses a BLAS library installed on your machine. On many Linux platform, its uses OpenBLAS or the Intel MKL. Both are very fast (they use all the methods described above) and should run in parallel.

You can find which implementation of BLAS is used by Numpy here. On my Linux machine, Numpy use by default CBLAS which internally use OpenBLAS (OpenBLAS is strangely not directly detected by Numpy).

There are many fast parallel BLAS implementations (GotoBLAS, ATLAS, BLIS, etc.). The open-source BLIS library is great because its matrix multiplication is very fast on many different architectures.

As a result, the simplest way to improve your C++ code is to use the cblas_sgemm CBLAS function and link a fast BLAS library like OpenBLAS or BLIS for example.

For more information:

One simple way to see how bad the GSL perform is to use a profiler (like perf on Linux or VTune on Windows). In your case Linux perf, report that >99% of the time is spent in libgslcblas.so (ie. the GSL library). More specifically, most of the execution time is spent in this following assembly loop:

250:   movss   (%rdx),%xmm1
       add     $0x4,%rax
       add     $0x4,%rdx
       mulss   %xmm2,%xmm1           # scalar instructions
       addss   -0x4(%rax),%xmm1
       movss   %xmm1,-0x4(%rax)
       cmp     %rax,%r9
     ↑ jne     250

As for Numpy, 99% of its time is spent in libopenblasp-r0.3.13.so (ie. the OpenBLAS library). More specifically in the following assembly code of the function dgemm_kernel_HASWELL:

110:   lea          0x80(%rsp),%rsi 
       add          $0x60,%rsi 
       mov          %r12,%rax 
       sar          $0x3,%rax 
       cmp          $0x2,%rax 
     ↓ jl           d26 
       prefetcht0   0x200(%rdi)          # Data prefetching
       vmovups      -0x60(%rsi),%ymm1 
       prefetcht0   0xa0(%rsi)
       vbroadcastsd -0x80(%rdi),%ymm0    # Fast SIMD instruction (AVX)
       prefetcht0   0xe0(%rsi)
       vmovups      -0x40(%rsi),%ymm2 
       prefetcht0   0x120(%rsi)
       vmovups      -0x20(%rsi),%ymm3 
       vmulpd       %ymm0,%ymm1,%ymm4
       prefetcht0   0x160(%rsi)
       vmulpd       %ymm0,%ymm2,%ymm8 
       vmulpd       %ymm0,%ymm3,%ymm12 
       prefetcht0   0x1a0(%rsi)
       vbroadcastsd -0x78(%rdi),%ymm0 
       vmulpd       %ymm0,%ymm1,%ymm5 
       vmulpd       %ymm0,%ymm2,%ymm9 
       [...]

We can clearly see that the GSL code is not optimized (because of the scalar code and the naive simple loop) and that OpenBLAS code is optimized as it uses at least wide SIMD instructions, data prefetching and loop unroling. Note that the executed OpenBLAS code is not optimal as it could use the FMA instructions available on my processor.

152

answered Oct 08 '22 20:10

Jérôme Richard

Related questions
                            
                                Python, importing modules for testing
                            
                                What numbers that I can put in numpy.random.seed()?
                            
                                Adding extra functionality to parent class method without changing its name [duplicate]
                            
                                How to change User representation in Django Admin when used as Foreign Key?
                            
                                Celery and Flask in same docker-compose
                            
                                How to Use Lagged Time-Series Variables in a Python Pandas Regression Model?
                            
                                Ordering boxplot x-axis in seaborn
                            
                                Pandas pivot_table, sort values by columns
                            
                                How do I use Boto3 to launch an EC2 instance with an IAM role?
                            
                                `pyspark mllib` versus `pyspark ml` packages
                            
                                How two consecutive yield statement work in python?
                            
                                python del vs pandas drop
                            
                                Computing TF-IDF on the whole dataset or only on training data?
                            
                                Can structured logging be done with Pythons standard library?
                            
                                Using SSL with SQLAlchemy
                            
                                Seaborn Color Palette not working appropiate with lineplot
                            
                                Using sklearn, how do I find depth of a decision tree?
                            
                                asyncio run or run_until_complete
                            
                                How can I one hot encode a list of strings with Keras?
                            
                                Is it possible to have a red squiggly line appear under words in a Tkinter text widget without using canvas? (Like for misspelled words)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With