I am trying to estimate how good is Python performance comparing to C++. Here is my Python code: <pre class="prettyprint"><code>a=np.random.rand(1000,1000) #type is automaically float64 b=np.random.rand(1000,1000) c=np.empty((1000,1000),dtype='float64') %timeit a.dot(b,out=c) #15.5 ms ± 560 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </code></pre> And here is my C++ code that I compile with Xcode in release regime: <pre class="prettyprint"><code>#include <iostream> #include <Dense> #include <time.h> using namespace Eigen; using namespace std; int main(int argc, const char * argv[]) { //RNG generator unsigned int seed = clock(); srand(seed); int Msize=1000, Nloops=10; MatrixXd m1=MatrixXd::Random(Msize,Msize); MatrixXd m2=MatrixXd::Random(Msize,Msize); MatrixXd m3=MatrixXd::Random(Msize,Msize); cout << "Starting matrix multiplication test with " << Msize << "matrices" << endl; clock_t start=clock(); for (int i=0; i<Nloops; i++) m3=m1*m2; start = clock() - start; cout << "time elapsed for 1 multiplication: " << start / ((double) CLOCKS_PER_SEC * (double) Nloops) << " seconds" <<endl; return 0; </code></pre> } And the result is: <pre class="prettyprint"><code>Starting matrix multiplication test with 1000matrices time elapsed for 1 multiplication: 0.148856 seconds Program ended with exit code: 0 </code></pre> Which means that C++ program is 10 times slower. Alternatively, I've tried to compile cpp code in MAC terminal: <pre class="prettyprint"><code>g++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -o my_exec -O3 ./my_exec Starting matrix multiplication test with 1000matrices time elapsed for 1 multiplication: 0.150432 seconds </code></pre> I am aware of very similar question, however, it looks like there the issue was in matrix definitions. In my example I've used default eigen functions to create matrices from uniform distribution. Thanks, Mikhail Edit: I found out, that while numpy uses multithreading, Eigen does not use multiple threads by default (checked by <code>Eigen::nbThreads()</code> function). As suggested, I used <code>-march=native</code> option which reduced computation time by a factor of 3. Taking into account 8 threads available on my MAC, I can believe that with multithreading numpy runs 3 times faster.

After long and painful installations and compilations I've performed benchmarks in Matlab, C++ and Python. My computer: MAC OS High Sierra 10.13.6 with Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz (4 cores, 8 threads). I have Radeon Pro 560 4096 MB so that there was no GPUs involved in these tests (and I never configured openCL and didn't see it in <code>np.show_config()</code>). Software: Matlab 2018a, Python 3.6, C++ compilers: Apple LLVM version 9.1.0 (clang-902.0.39.2), g++-8 (Homebrew GCC 8.2.0) 8.2.0 1) Matlab performance: time= (14.3 +- 0.7 ) ms with 10 runs performed <pre class="prettyprint"><code>a=rand(1000,1000); b=rand(1000,1000); c=rand(1000,1000); tic for i=1:100 c=a*b; end toc/100 </code></pre> 2) Python performance (<code>%timeit a.dot(b,out=c)</code>): 15.5 +- 0.8 I've also installed mkl libraries for python. With numpy linked against mkl: 14.4+-0.7 - it helps, but very little. 3) C++ performance. The following changes to the original (see the question) code were applied: <ul> <li><code>noalias</code> function to avoid unnecessary temporal matrices creation.</li> <li>Time was measured with c++11 <code>chorno</code> library</li> </ul> Here I used a bunch of different options and two different compilers: <pre class="prettyprint"><code>3.1 clang++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3 </code></pre> Execution time ~ 146 ms <pre class="prettyprint"><code>3.2 Added -march=native option: </code></pre> Execution time ~ 46 +-2 ms <pre class="prettyprint"><code>3.3 Changed compiler to GNU g++ (in my mac it is called gpp by custom-defined alias): gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3 </code></pre> Execution time 222 ms <pre class="prettyprint"><code>3.4 Added - march=native option: </code></pre> Execution time ~ 45.5 +- 1 ms At this point I realized that Eigen does not use multiple threads. I installed openmp and added -fopenmp flag. Note that on the latest clang version openmp does not work, thus I had to use g++ from now on. I also made sure I am actually using all available threads by monitoring the value of <code>Eigen::nbthreads()</code> and by using MAC OS activity monitor. <pre class="prettyprint"><code>3.5 gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3 -march=native -fopenmp </code></pre> Execution time: 16.5 +- 0.7 ms 3.6 Finally, I installed Intel mkl libraries. In the code it is quite easy to use them: I've just added <code>#define EIGEN_USE_MKL_ALL</code> macro and that's it. It was hard to link all the libraries though: <pre class="prettyprint"><code>gpp -std=c++11 -DMKL_LP64 -m64 -I${MKLROOT}/include -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl main.cpp -o my_exec_intel -O3 -fopenmp -march=native </code></pre> Execution time: 14.33 +-0.26 ms. (Editor's note: this answer originally claimed to have used <code>-DMKL_ILP64</code> which is not supported. Maybe it used to be, or happened to work.) Conclusion: <ul> <li>Matrix-matrix multiplication in Python/Matlab is highly optimized. It is not possible (or, at least, very hard) to do significantly better (on a CPU).</li> <li>CPP code (at least on MAC platform) can only achieve similar performance if fully optimized, which includes full set of optimization options and Intel mkl libraries. I could have installed old clang compiler with openmp support, but since the single-thread performance is similar (~46 ms), it looks like this will not help.</li> <li>It would be great to try it with native Intel compiler <code>icc</code>. Unfortunately, this is proprietary software, unlike Intel mkl libraries.</li> </ul> Thanks for useful discussion, Mikhail Edit: For comparison, I've also benchmarked my GTX 980 GPU using cublasDgemm function. Computational time = 12.6 ms, which is compatible with other results. The reason CUDA is almost as good as CPU is the following: my GPU is poorly optimized for doubles. With floats, GPU time =0.43 ms, while Matlab's is 7.2 ms Edit 2: to gain significant GPU acceleration, I would need to benchmark matrices with much bigger sizes, e.g. 10k x 10k Edit 3: changed the interface from MKL_ILP64 to MKL_LP64 since ILP64 is not supported.

Benchmarking matrix multiplication performance: C++ (eigen) is much slower than Python

Tags:

c++

python

benchmarking

numpy

eigen

I am trying to estimate how good is Python performance comparing to C++.

Here is my Python code:

a=np.random.rand(1000,1000) #type is automaically float64
b=np.random.rand(1000,1000) 
c=np.empty((1000,1000),dtype='float64')

%timeit a.dot(b,out=c)

#15.5 ms ± 560 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And here is my C++ code that I compile with Xcode in release regime:

#include <iostream>
#include <Dense>
#include <time.h>

using namespace Eigen;
using namespace std;

int main(int argc, const char * argv[]) {
    //RNG generator
    unsigned int seed = clock();
    srand(seed);

    int Msize=1000, Nloops=10;

    MatrixXd m1=MatrixXd::Random(Msize,Msize);
    MatrixXd m2=MatrixXd::Random(Msize,Msize);
    MatrixXd m3=MatrixXd::Random(Msize,Msize);

    cout << "Starting matrix multiplication test with " << Msize << 
    "matrices" << endl;
    clock_t start=clock();
    for (int i=0; i<Nloops; i++)
        m3=m1*m2;
    start = clock() - start;

    cout << "time elapsed for 1 multiplication: " << start / ((double) 
CLOCKS_PER_SEC * (double) Nloops) << " seconds" <<endl;
return 0;

}

And the result is:

Starting matrix multiplication test with 1000matrices
time elapsed for 1 multiplication: 0.148856 seconds
Program ended with exit code: 0

Which means that C++ program is 10 times slower.

Alternatively, I've tried to compile cpp code in MAC terminal:

g++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -o my_exec -O3

./my_exec

Starting matrix multiplication test with 1000matrices
time elapsed for 1 multiplication: 0.150432 seconds

I am aware of very similar question, however, it looks like there the issue was in matrix definitions. In my example I've used default eigen functions to create matrices from uniform distribution.

Thanks, Mikhail

Edit: I found out, that while numpy uses multithreading, Eigen does not use multiple threads by default (checked by Eigen::nbThreads() function). As suggested, I used -march=native option which reduced computation time by a factor of 3. Taking into account 8 threads available on my MAC, I can believe that with multithreading numpy runs 3 times faster.

840

asked Aug 02 '18 15:08

Mikhail Genkin

1 Answers

After long and painful installations and compilations I've performed benchmarks in Matlab, C++ and Python.

My computer: MAC OS High Sierra 10.13.6 with Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz (4 cores, 8 threads). I have Radeon Pro 560 4096 MB so that there was no GPUs involved in these tests (and I never configured openCL and didn't see it in np.show_config()).

Software: Matlab 2018a, Python 3.6, C++ compilers: Apple LLVM version 9.1.0 (clang-902.0.39.2), g++-8 (Homebrew GCC 8.2.0) 8.2.0

1) Matlab performance: time= (14.3 +- 0.7 ) ms with 10 runs performed

a=rand(1000,1000);
b=rand(1000,1000);
c=rand(1000,1000);
tic
for i=1:100
    c=a*b;
end
toc/100

2) Python performance (%timeit a.dot(b,out=c)): 15.5 +- 0.8

I've also installed mkl libraries for python. With numpy linked against mkl: 14.4+-0.7 - it helps, but very little.

3) C++ performance. The following changes to the original (see the question) code were applied:

noalias function to avoid unnecessary temporal matrices creation.
Time was measured with c++11 chorno library

Here I used a bunch of different options and two different compilers:

3.1 clang++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3

Execution time ~ 146 ms

3.2 Added -march=native option:

Execution time ~ 46 +-2 ms

3.3 Changed compiler to GNU g++ (in my mac it is called gpp by custom-defined alias):

gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3

Execution time 222 ms

3.4 Added - march=native option:

Execution time ~ 45.5 +- 1 ms

At this point I realized that Eigen does not use multiple threads. I installed openmp and added -fopenmp flag. Note that on the latest clang version openmp does not work, thus I had to use g++ from now on. I also made sure I am actually using all available threads by monitoring the value of Eigen::nbthreads() and by using MAC OS activity monitor.

3.5  gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3 -march=native -fopenmp

Execution time: 16.5 +- 0.7 ms

3.6 Finally, I installed Intel mkl libraries. In the code it is quite easy to use them: I've just added #define EIGEN_USE_MKL_ALL macro and that's it. It was hard to link all the libraries though:

gpp -std=c++11 -DMKL_LP64 -m64 -I${MKLROOT}/include -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl   main.cpp -o my_exec_intel -O3 -fopenmp  -march=native

Execution time: 14.33 +-0.26 ms. (Editor's note: this answer originally claimed to have used -DMKL_ILP64 which is not supported. Maybe it used to be, or happened to work.)

Conclusion:

Matrix-matrix multiplication in Python/Matlab is highly optimized. It is not possible (or, at least, very hard) to do significantly better (on a CPU).
CPP code (at least on MAC platform) can only achieve similar performance if fully optimized, which includes full set of optimization options and Intel mkl libraries. I could have installed old clang compiler with openmp support, but since the single-thread performance is similar (~46 ms), it looks like this will not help.
It would be great to try it with native Intel compiler icc. Unfortunately, this is proprietary software, unlike Intel mkl libraries.

Thanks for useful discussion,

Mikhail

Edit: For comparison, I've also benchmarked my GTX 980 GPU using cublasDgemm function. Computational time = 12.6 ms, which is compatible with other results. The reason CUDA is almost as good as CPU is the following: my GPU is poorly optimized for doubles. With floats, GPU time =0.43 ms, while Matlab's is 7.2 ms

Edit 2: to gain significant GPU acceleration, I would need to benchmark matrices with much bigger sizes, e.g. 10k x 10k

Edit 3: changed the interface from MKL_ILP64 to MKL_LP64 since ILP64 is not supported.

answered Sep 29 '22 13:09

Mikhail Genkin

Related questions
                            
                                How can I apply a function to itself?
                            
                                How to import python files in google colaboratory?
                            
                                No module named 'beautifulsoup4' in python3
                            
                                Why doesn't Python have a "continue if" statement?
                            
                                Predict probabilities using SVM
                            
                                weird behavior when importing os.path
                            
                                How to pass arguments to Tornado's WebSocketHandler class?
                            
                                pandas DataFrame: normalize one JSON column and merge with other columns
                            
                                Cannot load tensorflow_hub
                            
                                Pandas Query with Variable as Column Name
                            
                                Get first non-null value per row
                            
                                Overriding abstract methods in python
                            
                                How to predict from saved model in Keras ?
                            
                                How to find the number of bands in gdal in python?
                            
                                Custom Data Generator for Keras LSTM with TimeSeriesGenerator
                            
                                How to send file to response in Django?
                            
                                Vim error : Error detected while processing function <SNR>14_UseConfigFiles[26]..<SNR>14_Initialize[47]..<SNR>14_InitializePythonBuiltin:
                            
                                Google Colab not updating package?
                            
                                Python: Revert sys.stdout to default
                            
                                WARNING:tensorflow:Ignoring detection with image id despite true config parameters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Benchmarking matrix multiplication performance: C++ (eigen) is much slower than Python

Tags:

c++

python

benchmarking

numpy

eigen

Mikhail Genkin

People also ask

1 Answers

Mikhail Genkin

Recent Activity

Donate For Us