Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write a matrix matrix product that can compete with Eigen?

Below is the C++ implementation comparing the time taken by Eigen and For Loop to perform matrix-matrix products. The For loop has been optimised to minimise cache misses. The for loop is faster than Eigen initially but then eventually becomes slower (upto a factor of 2 for 500 by 500 matrices). What else should I do to compete with Eigen? Is blocking the reason for the better Eigen performance? If so, how should I go about adding blocking to the for loop?

#include<iostream>
#include<Eigen/Dense>
#include<ctime>

int main(int argc, char* argv[]) {
    srand(time(NULL));
    //  Input the size of the matrix from the user
    int N   =   atoi(argv[1]);

    int M   =   N*N;

    //  The matrices stored as row-wise vectors
    double a[M];
    double b[M];
    double c[M];

    //  Initializing Eigen Matrices
    Eigen::MatrixXd a_E =   Eigen::MatrixXd::Random(N,N);
    Eigen::MatrixXd b_E =   Eigen::MatrixXd::Random(N,N);
    Eigen::MatrixXd c_E(N,N);

    double CPS  =   CLOCKS_PER_SEC;

    clock_t start, end;


    //  Matrix vector product by Eigen
    start   =   clock();
    c_E     =   a_E*b_E;
    end     =   clock();

    std::cout << "\nTime taken by Eigen is: " << (end-start)/CPS << "\n";

    //  Initializing For loop vectors
    int count   =   0;
    for (int j=0; j<N; ++j) {
        for (int k=0; k<N; ++k) {
            a[count]    =   a_E(j,k);
            b[count]    =   b_E(j,k);
            ++count;
        }
    }

    //  Matrix vector product by For loop
    start   =   clock();
    count   =   0;
    int count1, count2;
    for (int j=0; j<N; ++j) {
        count1  =   j*N;
        for (int k=0; k<N; ++k) {
            c[count]    =   a[count1]*b[k];
            ++count;
        }
    }

    for (int j=0; j<N; ++j) {
        count2  =   N;
        for (int l=1; l<N; ++l) {
            count   =   j*N;
            count1  =   count+l;
            for (int k=0; k<N; ++k) {
                c[count]+=a[count1]*b[count2];
                ++count;
                ++count2;
            }
        }
    }
    end     =   clock();

    std::cout << "\nTime taken by for-loop is: " << (end-start)/CPS << "\n";
}
like image 465
Adhvaitha Avatar asked Feb 25 '16 07:02

Adhvaitha


1 Answers

Your code is already well vectorized by the compiler. The key for higher performance is hierarchical blocking to optimize the usage of registers, and of the different level of caches. Partial loop unrolling is also crucial to improve instruction pipelining. Reaching the performance of Eigen's product require a lot of effort and tuning.

It should also be noted that your benchmark is slightly biased and not fully reliable. Here is a more reliable version (you need complete Eigen's sources to get bench/BenchTimer.h):

#include<iostream>
#include<Eigen/Dense>
#include<bench/BenchTimer.h>

void myprod(double *c, const double* a, const double* b, int N) {
  int count = 0;
  int count1, count2;
  for (int j=0; j<N; ++j) {
      count1  =   j*N;
      for (int k=0; k<N; ++k) {
          c[count]    =   a[count1]*b[k];
          ++count;
      }
  }

  for (int j=0; j<N; ++j) {
      count2  =   N;
      for (int l=1; l<N; ++l) {
          count   =   j*N;
          count1  =   count+l;
          for (int k=0; k<N; ++k) {
              c[count]+=a[count1]*b[count2];
              ++count;
              ++count2;
          }
      }
  }
}

int main(int argc, char* argv[]) {
  int N = atoi(argv[1]);
  int tries = 4;
  int rep = std::max<int>(1,10000000/N/N/N);

  Eigen::MatrixXd a_E = Eigen::MatrixXd::Random(N,N);
  Eigen::MatrixXd b_E = Eigen::MatrixXd::Random(N,N);
  Eigen::MatrixXd c_E(N,N);

  Eigen::BenchTimer t1, t2;

  BENCH(t1, tries, rep, c_E.noalias() = a_E*b_E );
  BENCH(t2, tries, rep, myprod(c_E.data(), a_E.data(), b_E.data(), N));

  std::cout << "\nTime taken by Eigen is: " << t1.best() << "\n";
  std::cout << "\nTime taken by for-loop is: " << t2.best() << "\n";
}

Compiling with 3.3-beta1 and FMA enabled (-mfma), then the gap becomes much larger, almost one order of magnitude for N=2000.

like image 135
ggael Avatar answered Oct 12 '22 10:10

ggael