Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Eigen + MKL slower than Matlab for matrix multiplication

I am doing a lot of matrix multiplications in a C++ program and I use Eigen (3.3.5) linked with Intel's MKL (2018.3.222). I use the sequential version of the MKL and OpenMP is disabled. The problem is that it is slower than Matlab.

Some example code:

#define NDEBUG
#define EIGEN_USE_MKL_ALL

#include <iostream>
#include <chrono>
#include <Core>

using namespace Eigen;
using namespace std;

int main(){
    MatrixXd jac = 100*MatrixXd::Random(10*1228, 2850);
    MatrixXd res = MatrixXd::Zero(2850, 2850);

    for (int i=0; i<10; i++){
        auto begin = chrono::high_resolution_clock::now();
        res.noalias() = jac.transpose()*jac;
        auto end = chrono::high_resolution_clock::now();

        cout<<"time: "<<chrono::duration_cast<chrono::milliseconds>(end-begin).count() <<endl;
    }

    return 0;
}

It reports about 8 seconds on average. Compiled with -O3 and no debug symbols on Ubuntu 16.04 with g++ 6.4.

The Matlab code:

m=100*(-1+2*rand(10*1228, 2850));
res = zeros(2850, 2850);
tic; res=m'*m; toc

It reports ~4 seconds, which is two times faster. I used Matlab R2017a on the same system with maxNumCompThreads(1). Matlab uses MKL 11.3.

Without MKL and using only Eigen, it takes about 18s. What can I do to bring the C++ running time down to the same value as Matlab's? Thank you.

Later Edit: As @Qubit suggested, Matlab recognises that I am trying to multiply a matrix with its transpose and does some 'hidden' optimization. When I multiplied two different matrices in Matlab, the time went up to those 8 seconds. So, now the problem becomes: how can I tell Eigen that this matrix product is 'special' and could be optimized further?

Later Edit 2: I tried doing it like this:

MatrixXd jac = 100*MatrixXd::Random(10*1228, 2850);
MatrixXd res = MatrixXd::Zero(2850, 2850);

auto begin = chrono::high_resolution_clock::now();
res.selfadjointView<Lower>().rankUpdate(jac.transpose(), 1);
res.triangularView<Upper>() = res.transpose();
auto end = chrono::high_resolution_clock::now();

MatrixXd oldSchool = jac.transpose()*jac;
if (oldSchool.isApprox(res)){
    cout<<"same result!"<<endl;
}
cout<<"time: "<<chrono::duration_cast<chrono::milliseconds>(end-begin).count() <<endl;

but now it takes 9.4 seconds (which is half of the time Eigen with no MKL requires for the classic product). Disabling the MKL has no time effect on this timing, therefore I believe the 'rankUpdate' method does not use MKL ?!?

Last EDIT: I have found a bug in eigen header file:

Core/products/GeneralMatrixMatrixTriangular_BLAS.h

at line 55. There was a misplaced parenthesis. I changed this:

if ( lhs==rhs && ((UpLo&(Lower|Upper)==UpLo)) ) { \

to this:

if ( lhs==rhs && ((UpLo&(Lower|Upper))==UpLo) ) { \

Now, my C++ version and Matlab have the same execution speed (of ~4 seconds on my system).

like image 510
Costin Florian Ciusdel Avatar asked Aug 16 '18 09:08

Costin Florian Ciusdel


People also ask

Is Eigen faster than Blas?

Eigen is faster than every Free BLAS, such as ATLAS or Boost::uBlas.

Why Matlab Matrix is fast?

The general answer to "Why is matlab faster at doing xxx than other programs" is that matlab has a lot of built in, optimized functions. The other programs that are used often do not have these functions so people apply their own creative solutions, which are suprisingly slower than professionally optimized code.

Does Eigen use MKL?

Eigen: Using Intel® MKL from Eigen. Since Eigen version 3.1 and later, users can benefit from built-in Intel® Math Kernel Library (MKL) optimizations with an installed copy of Intel MKL 10.3 (or later). Intel MKL provides highly optimized multi-threaded mathematical routines for x86-compatible architectures.


1 Answers

To really an answer since you already figured out the issues, but some comments:

  1. The issue Core/products/GeneralMatrixMatrixTriangular_BLAS.h was already fixed in the devel branch, but it turns out it has never been brackported to the 3.3 branch.

  2. The issue is now fixed in the 3.3 branch. The fix will be part of 3.3.6.

  3. A speedup factor x2 between built-in Eigen and MKL in single thread mode does not make sense. Make sure to enable all features your CPU support by compiling with -march=native in addition to -O3 -DNDEBUG. On my Haswell 2.6GHz I get 3.4s vs 3s.

like image 100
ggael Avatar answered Oct 05 '22 21:10

ggael