Eigen + MKL slower than Matlab for matrix multiplication

Tags:

I am doing a lot of matrix multiplications in a C++ program and I use Eigen (3.3.5) linked with Intel's MKL (2018.3.222). I use the sequential version of the MKL and OpenMP is disabled. The problem is that it is slower than Matlab.

Some example code:

#define NDEBUG
#define EIGEN_USE_MKL_ALL

#include <iostream>
#include <chrono>
#include <Core>

using namespace Eigen;
using namespace std;

int main(){
    MatrixXd jac = 100*MatrixXd::Random(10*1228, 2850);
    MatrixXd res = MatrixXd::Zero(2850, 2850);

    for (int i=0; i<10; i++){
        auto begin = chrono::high_resolution_clock::now();
        res.noalias() = jac.transpose()*jac;
        auto end = chrono::high_resolution_clock::now();

        cout<<"time: "<<chrono::duration_cast<chrono::milliseconds>(end-begin).count() <<endl;
    }

    return 0;
}

It reports about 8 seconds on average. Compiled with -O3 and no debug symbols on Ubuntu 16.04 with g++ 6.4.

The Matlab code:

m=100*(-1+2*rand(10*1228, 2850));
res = zeros(2850, 2850);
tic; res=m'*m; toc

It reports ~4 seconds, which is two times faster. I used Matlab R2017a on the same system with maxNumCompThreads(1). Matlab uses MKL 11.3.

Without MKL and using only Eigen, it takes about 18s. What can I do to bring the C++ running time down to the same value as Matlab's? Thank you.

Later Edit: As @Qubit suggested, Matlab recognises that I am trying to multiply a matrix with its transpose and does some 'hidden' optimization. When I multiplied two different matrices in Matlab, the time went up to those 8 seconds. So, now the problem becomes: how can I tell Eigen that this matrix product is 'special' and could be optimized further?

Later Edit 2: I tried doing it like this:

MatrixXd jac = 100*MatrixXd::Random(10*1228, 2850);
MatrixXd res = MatrixXd::Zero(2850, 2850);

auto begin = chrono::high_resolution_clock::now();
res.selfadjointView<Lower>().rankUpdate(jac.transpose(), 1);
res.triangularView<Upper>() = res.transpose();
auto end = chrono::high_resolution_clock::now();

MatrixXd oldSchool = jac.transpose()*jac;
if (oldSchool.isApprox(res)){
    cout<<"same result!"<<endl;
}
cout<<"time: "<<chrono::duration_cast<chrono::milliseconds>(end-begin).count() <<endl;

but now it takes 9.4 seconds (which is half of the time Eigen with no MKL requires for the classic product). Disabling the MKL has no time effect on this timing, therefore I believe the 'rankUpdate' method does not use MKL ?!?

Last EDIT: I have found a bug in eigen header file:

Core/products/GeneralMatrixMatrixTriangular_BLAS.h

at line 55. There was a misplaced parenthesis. I changed this:

if ( lhs==rhs && ((UpLo&(Lower|Upper)==UpLo)) ) { \

to this:

if ( lhs==rhs && ((UpLo&(Lower|Upper))==UpLo) ) { \

Now, my C++ version and Matlab have the same execution speed (of ~4 seconds on my system).

510

asked Aug 16 '18 09:08

Costin Florian Ciusdel

1 Answers

To really an answer since you already figured out the issues, but some comments:

The issue Core/products/GeneralMatrixMatrixTriangular_BLAS.h was already fixed in the devel branch, but it turns out it has never been brackported to the 3.3 branch.
The issue is now fixed in the 3.3 branch. The fix will be part of 3.3.6.
A speedup factor x2 between built-in Eigen and MKL in single thread mode does not make sense. Make sure to enable all features your CPU support by compiling with -march=native in addition to -O3 -DNDEBUG. On my Haswell 2.6GHz I get 3.4s vs 3s.

100

answered Oct 05 '22 21:10

ggael

Related questions
                            
                                How to pass multiple uniforms efficiently and dynamically to GLSL
                            
                                Replace a chain of image blurs with one blur
                            
                                Simultaneous parameter pack expansion error for unused template type definition
                            
                                How do I delay the instantiation of a static data member in Visual C++?
                            
                                How to tell clang-format to keep whitespace between binary operators as they are
                            
                                Why does this simple lambda consistently run faster inside an std::thread than inside the main function with gcc 4.9.2?
                            
                                Dispatching r-values and l-values differently and using sfinae to disable one option
                            
                                Extreme slow-down when starting at second permutation
                            
                                C++ inline functions: declare as such, define as such, or both? Why?
                            
                                Reasonable snprintf-like alternatives to strftime?
                            
                                Disabling an annoying debugger notification in Visual Studio 2017
                            
                                Regarding friend function definition and namespace scopes
                            
                                Annoying error message: cannot merge previous GCDA file
                            
                                How do I allocate memory-aligned C++ object arrays? [duplicate]
                            
                                Return result invisibly
                            
                                Visual Studio adding truly "global" default include path
                            
                                Returning const value from arithmetic operator overload with move assignment
                            
                                Is invocable and ambiguous call: bug in either g++ or clang
                            
                                Ambiguous constructor taking std::reference_wrapper when compiling with -pedantic
                            
                                How to tell if a translation unit is being compiled with segmented stacks

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Eigen + MKL slower than Matlab for matrix multiplication

Tags:

c++

intel

matlab

eigen

intel-mkl

Costin Florian Ciusdel

People also ask

1 Answers

ggael

Recent Activity

Donate For Us