I am trying to estimate how good is Python performance comparing to C++.
Here is my Python code:
a=np.random.rand(1000,1000) #type is automaically float64
b=np.random.rand(1000,1000)
c=np.empty((1000,1000),dtype='float64')
%timeit a.dot(b,out=c)
#15.5 ms ± 560 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And here is my C++ code that I compile with Xcode in release regime:
#include <iostream>
#include <Dense>
#include <time.h>
using namespace Eigen;
using namespace std;
int main(int argc, const char * argv[]) {
//RNG generator
unsigned int seed = clock();
srand(seed);
int Msize=1000, Nloops=10;
MatrixXd m1=MatrixXd::Random(Msize,Msize);
MatrixXd m2=MatrixXd::Random(Msize,Msize);
MatrixXd m3=MatrixXd::Random(Msize,Msize);
cout << "Starting matrix multiplication test with " << Msize <<
"matrices" << endl;
clock_t start=clock();
for (int i=0; i<Nloops; i++)
m3=m1*m2;
start = clock() - start;
cout << "time elapsed for 1 multiplication: " << start / ((double)
CLOCKS_PER_SEC * (double) Nloops) << " seconds" <<endl;
return 0;
}
And the result is:
Starting matrix multiplication test with 1000matrices
time elapsed for 1 multiplication: 0.148856 seconds
Program ended with exit code: 0
Which means that C++ program is 10 times slower.
Alternatively, I've tried to compile cpp code in MAC terminal:
g++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -o my_exec -O3
./my_exec
Starting matrix multiplication test with 1000matrices
time elapsed for 1 multiplication: 0.150432 seconds
I am aware of very similar question, however, it looks like there the issue was in matrix definitions. In my example I've used default eigen functions to create matrices from uniform distribution.
Thanks, Mikhail
Edit: I found out, that while numpy uses multithreading, Eigen does not use multiple threads by default (checked by Eigen::nbThreads()
function).
As suggested, I used -march=native
option which reduced computation time by a factor of 3. Taking into account 8 threads available on my MAC, I can believe that with multithreading numpy runs 3 times faster.
Eigen of Symmetric Matrices is 2x slower than NumPy #30325.
matrix multiplication is 30 times slower for integers than floats on CPU.
Matrix multiplications in NumPy are reasonably fast without the need for optimization. However, if every second counts, it is possible to significantly improve performance (even without a GPU).
For operations involving complex expressions, Eigen is inherently faster than any BLAS implementation because it can handle and optimize a whole operation globally -- while BLAS forces the programmer to split complex operations into small steps that match the BLAS fixed-function API, which incurs inefficiency due to ...
After long and painful installations and compilations I've performed benchmarks in Matlab, C++ and Python.
My computer: MAC OS High Sierra 10.13.6 with Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz (4 cores, 8 threads). I have Radeon Pro 560 4096 MB so that there was no GPUs involved in these tests (and I never configured openCL and didn't see it in np.show_config()
).
Software: Matlab 2018a, Python 3.6, C++ compilers: Apple LLVM version 9.1.0 (clang-902.0.39.2), g++-8 (Homebrew GCC 8.2.0) 8.2.0
1) Matlab performance: time= (14.3 +- 0.7 ) ms with 10 runs performed
a=rand(1000,1000);
b=rand(1000,1000);
c=rand(1000,1000);
tic
for i=1:100
c=a*b;
end
toc/100
2) Python performance (%timeit a.dot(b,out=c)
): 15.5 +- 0.8
I've also installed mkl libraries for python. With numpy linked against mkl: 14.4+-0.7 - it helps, but very little.
3) C++ performance. The following changes to the original (see the question) code were applied:
noalias
function to avoid unnecessary temporal matrices creation.
Time was measured with c++11 chorno
library
Here I used a bunch of different options and two different compilers:
3.1 clang++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3
Execution time ~ 146 ms
3.2 Added -march=native option:
Execution time ~ 46 +-2 ms
3.3 Changed compiler to GNU g++ (in my mac it is called gpp by custom-defined alias):
gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3
Execution time 222 ms
3.4 Added - march=native option:
Execution time ~ 45.5 +- 1 ms
At this point I realized that Eigen does not use multiple threads. I installed openmp and added -fopenmp flag. Note that on the latest clang version openmp does not work, thus I had to use g++ from now on. I also made sure I am actually using all available threads by monitoring the value of Eigen::nbthreads()
and by using MAC OS activity monitor.
3.5 gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3 -march=native -fopenmp
Execution time: 16.5 +- 0.7 ms
3.6 Finally, I installed Intel mkl libraries. In the code it is quite easy to use them: I've just added #define EIGEN_USE_MKL_ALL
macro and that's it. It was hard to link all the libraries though:
gpp -std=c++11 -DMKL_LP64 -m64 -I${MKLROOT}/include -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl main.cpp -o my_exec_intel -O3 -fopenmp -march=native
Execution time: 14.33 +-0.26 ms. (Editor's note: this answer originally claimed to have used -DMKL_ILP64
which is not supported. Maybe it used to be, or happened to work.)
Conclusion:
Matrix-matrix multiplication in Python/Matlab is highly optimized. It is not possible (or, at least, very hard) to do significantly better (on a CPU).
CPP code (at least on MAC platform) can only achieve similar performance if fully optimized, which includes full set of optimization options and Intel mkl libraries. I could have installed old clang compiler with openmp support, but since the single-thread performance is similar (~46 ms), it looks like this will not help.
It would be great to try it with native Intel compiler icc
. Unfortunately, this is proprietary software, unlike Intel mkl libraries.
Thanks for useful discussion,
Mikhail
Edit: For comparison, I've also benchmarked my GTX 980 GPU using cublasDgemm function. Computational time = 12.6 ms, which is compatible with other results. The reason CUDA is almost as good as CPU is the following: my GPU is poorly optimized for doubles. With floats, GPU time =0.43 ms, while Matlab's is 7.2 ms
Edit 2: to gain significant GPU acceleration, I would need to benchmark matrices with much bigger sizes, e.g. 10k x 10k
Edit 3: changed the interface from MKL_ILP64 to MKL_LP64 since ILP64 is not supported.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With