Why is it that the matrix multiplication with Numpy is much faster than gsl_blas_sgemm
from GSL, for instance:
import numpy as np
import time
N = 1000
M = np.zeros(shape=(N, N), dtype=np.float)
for i in range(N):
for j in range(N):
M[i, j] = 0.23 + 100*i + j
tic = time.time()
np.matmul(M, M)
toc = time.time()
print(toc - tic)
gives something between 0.017 - 0.019 seconds, while in C++:
#include <chrono>
#include <iostream>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_blas.h>
using namespace std::chrono;
int main(void) {
int N = 1000;
gsl_matrix_float* M = gsl_matrix_float_alloc(N, N);
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
gsl_matrix_float_set(M, i, j, 0.23 + 100 * i + j);
}
}
gsl_matrix_float* C = gsl_matrix_float_alloc(N, N); // save the result into C
auto start = high_resolution_clock::now();
gsl_blas_sgemm(CblasNoTrans, CblasNoTrans, 1.0, M, M, 0.0, C);
auto stop = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(stop - start);
std::cout << duration.count() << std::endl;
return 0;
}
I get a runtime of the multiplication of about 2.7 seconds. I am also compiling with the maximum speed option /02
. I am working with Visual Studio. I must do something very wrong. I was not expecting a much better performance from the C++ code because I am aware that Numpy is optimized C-Code, but neither was I expecting it to be about 150 times slower than python. Why is that? How can I improve the runtime of the multiplication relative to Numpy?
Background of the problem: I need to evaluate an 1000 to 2000 dimensional integral, and I am doing it with the Monte-Carlo method. For that I wrote almost the whole integrand as Numpy array operations, this works quite fast but i need it even faster in order to evaluate the same integrand 100.000 to 500.000 times, so any little improvement would help. Does it make sense to write the same code in C/C++ or should I stick to Numpy? Thanks!
NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.
1 Answer. Show activity on this post. TL;DR: the C++ code and Numpy do not use the same matrix-multiplication library. The matrix multiplication of the GSL library is not optimized.
In Python the numpy. multiply() function is used to calculate the multiplication between two numpy arrays and it is a universal function available in the numpy package module. This method takes several parameters and the two input arrays must have the same shape that they have the same number of columns and rows.
TL;DR: the C++ code and Numpy do not use the same matrix-multiplication library.
The matrix multiplication of the GSL library is not optimized. On my machine, it runs sequentially, does not use SIMD instructions (SSE/AVX), does not efficiently unroll the loops to perform register tiling. I also suspect it also does not use the CPU cache efficiently due to the lack of tiling. These optimizations are critical to achieve high-performance and widely used in fast linear algebra libraries.
Numpy uses a BLAS library installed on your machine. On many Linux platform, its uses OpenBLAS or the Intel MKL. Both are very fast (they use all the methods described above) and should run in parallel.
You can find which implementation of BLAS is used by Numpy here. On my Linux machine, Numpy use by default CBLAS which internally use OpenBLAS (OpenBLAS is strangely not directly detected by Numpy).
There are many fast parallel BLAS implementations (GotoBLAS, ATLAS, BLIS, etc.). The open-source BLIS library is great because its matrix multiplication is very fast on many different architectures.
As a result, the simplest way to improve your C++ code is to use the cblas_sgemm
CBLAS function and link a fast BLAS library like OpenBLAS or BLIS for example.
For more information:
One simple way to see how bad the GSL perform is to use a profiler (like perf on Linux or VTune on Windows). In your case Linux perf, report that >99% of the time is spent in libgslcblas.so
(ie. the GSL library). More specifically, most of the execution time is spent in this following assembly loop:
250: movss (%rdx),%xmm1
add $0x4,%rax
add $0x4,%rdx
mulss %xmm2,%xmm1 # scalar instructions
addss -0x4(%rax),%xmm1
movss %xmm1,-0x4(%rax)
cmp %rax,%r9
↑ jne 250
As for Numpy, 99% of its time is spent in libopenblasp-r0.3.13.so
(ie. the OpenBLAS library). More specifically in the following assembly code of the function dgemm_kernel_HASWELL
:
110: lea 0x80(%rsp),%rsi
add $0x60,%rsi
mov %r12,%rax
sar $0x3,%rax
cmp $0x2,%rax
↓ jl d26
prefetcht0 0x200(%rdi) # Data prefetching
vmovups -0x60(%rsi),%ymm1
prefetcht0 0xa0(%rsi)
vbroadcastsd -0x80(%rdi),%ymm0 # Fast SIMD instruction (AVX)
prefetcht0 0xe0(%rsi)
vmovups -0x40(%rsi),%ymm2
prefetcht0 0x120(%rsi)
vmovups -0x20(%rsi),%ymm3
vmulpd %ymm0,%ymm1,%ymm4
prefetcht0 0x160(%rsi)
vmulpd %ymm0,%ymm2,%ymm8
vmulpd %ymm0,%ymm3,%ymm12
prefetcht0 0x1a0(%rsi)
vbroadcastsd -0x78(%rdi),%ymm0
vmulpd %ymm0,%ymm1,%ymm5
vmulpd %ymm0,%ymm2,%ymm9
[...]
We can clearly see that the GSL code is not optimized (because of the scalar code and the naive simple loop) and that OpenBLAS code is optimized as it uses at least wide SIMD instructions, data prefetching and loop unroling. Note that the executed OpenBLAS code is not optimal as it could use the FMA instructions available on my processor.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With