Performance: Matlab vs C++ Matrix vector multiplication

Tags:

Preamble

Some time ago I asked a question about performance of Matlab vs Python (Performance: Matlab vs Python). I was surprised that Matlab is faster than Python, especially in meshgrid. In the discussion of that question, it was pointed to me that I should use a wrapper in Python to call my C++ code because C++ code is also available to me. I have the same code in C++, Matlab and Python.

While doing that, I was surprised once again to find that Matlab is faster than C++ in matrix assembly and computation.I have a slightly larger code, from which I am investigating a segment of matrix-vector multiplication. The larger code performs such multiplications at multiple instances. Overall the code in C++ is much much faster than Matlab (because function calling in Matlab has an overhead etc.), but Matlab seems to be outperforming C++ in the matrix-vector multiplication (code snippet at the bottom).

Results

The table below shows the comparison of time it takes to assemble the kernel matrix and the time it takes to multiply the matrix with the vector. The results are compiled for a matrix size NxN where N varies from 10,000 to 40,000. Which is not that large. But the interesting thing is that Matlab outperforms C++ the larger the N gets. Matlab is 3.8 - 5.8 times faster in total time. Moreover it is also faster in both matrix assembly and computation.

 ___________________________________________
|N=10,000   Assembly    Computation  Total  |
|MATLAB     0.3387      0.031        0.3697 |
|C++        1.15        0.24         1.4    |
|Times faster                        3.8    |
 ___________________________________________ 
|N=20,000   Assembly    Computation  Total  |
|MATLAB     1.089       0.0977       1.187  |
|C++        5.1         1.03         6.13   |
|Times faster                        5.2    |
 ___________________________________________
|N=40,000   Assembly    Computation  Total  |
|MATLAB     4.31        0.348        4.655  |
|C++        23.25       3.91         27.16  |
|Times faster                        5.8    |
 -------------------------------------------

Question

Is there a faster way of doing this in C++? Am I missing something? I understand that C++ is using for loops but my understanding is that Matlab will also be doing something similar in meshgrid.

Code Snippets

Matlab Code:

%% GET INPUT DATA FROM DATA FILES ------------------------------------------- %
% Read data from input file
Data       = load('Input/input.txt');
location   = Data(:,1:2);           
charges    = Data(:,3:end);         
N          = length(location);      
m          = size(charges,2);       

%% EXACT MATRIX VECTOR PRODUCT ---------------------------------------------- %
kex1=ex1; 
tic
Q = kex1.kernel_2D(location , location);
fprintf('\n Assembly time: %f ', toc);

tic
potential_exact = Q * charges;
fprintf('\n Computation time: %f \n', toc);

Class (Using meshgrid):

classdef ex1
    methods 
        function [kernel] = kernel_2D(obj, x,y) 
            [i1,j1] = meshgrid(y(:,1),x(:,1));
            [i2,j2] = meshgrid(y(:,2),x(:,2));
            kernel = sqrt( (i1 - j1) .^ 2 + (i2 - j2) .^2 );
        end
    end       
end

C++ Code:

EDIT

Compiled using a make file with following flags:

CC=g++ 
CFLAGS=-c -fopenmp -w -Wall -DNDEBUG -O3 -march=native -ffast-math -ffinite-math-only -I header/ -I /usr/include 
LDFLAGS= -g -fopenmp  
LIB_PATH= 

SOURCESTEXT= src/read_Location_Charges.cpp 
SOURCESF=examples/matvec.cpp
OBJECTSF= $(SOURCESF:.cpp=.o) $(SOURCESTEXT:.cpp=.o)
EXECUTABLEF=./exec/mykernel
mykernel: $(SOURCESF) $(SOURCESTEXT) $(EXECUTABLEF)
$(EXECUTABLEF): $(OBJECTSF)
    $(CC) $(LDFLAGS) $(KERNEL) $(INDEX) $(OBJECTSF) -o $@ $(LIB_PATH)
.cpp.o:
    $(CC) $(CFLAGS) $(KERNEL) $(INDEX) $< -o $@

# include"environment.hpp"
using namespace std;
using namespace Eigen;

class ex1 
{
public:
    void kernel_2D(const unsigned long M, double*& x, const unsigned long N,  double*&  y, MatrixXd& kernel)    {   
        kernel  =   MatrixXd::Zero(M,N);
        for(unsigned long i=0;i<M;++i)  {
            for(unsigned long j=0;j<N;++j)  {
                        double X =   (x[0*N+i] - y[0*N+j]) ;
                        double Y =   (x[1*N+i] - y[1*N+j]) ;
                        kernel(i,j) = sqrt((X*X) + (Y*Y));
            }
        }
    }
};

int main()
{
    /* Input ----------------------------------------------------------------------------- */
    unsigned long N = 40000;          unsigned m=1;                   
    double* charges;                  double* location;
    charges =   new double[N * m]();  location =    new double[N * 2]();
    clock_t start;                    clock_t end;
    double exactAssemblyTime;         double exactComputationTime;

    read_Location_Charges ("input/test_input.txt", N, location, m, charges);

    MatrixXd charges_           =   Map<MatrixXd>(charges, N, m);
    MatrixXd Q;
    ex1 Kex1;

    /* Process ------------------------------------------------------------------------ */
    // Matrix assembly
    start = clock();
        Kex1.kernel_2D(N, location, N, location, Q);
    end = clock();
    exactAssemblyTime = double(end-start)/double(CLOCKS_PER_SEC);

    //Computation
    start = clock();
        MatrixXd QH = Q * charges_;
    end = clock();
    exactComputationTime = double(end-start)/double(CLOCKS_PER_SEC);

    cout << endl << "Assembly     time: " << exactAssemblyTime << endl;
    cout << endl << "Computation time: " << exactComputationTime << endl;


    // Clean up
    delete []charges;
    delete []location;
    return 0;
}

922

asked Oct 12 '17 16:10

Fahd Siddiqui

1 Answers

As said in the comments MatLab relies on Intel's MKL library for matrix products, which is the fastest library for such kind of operations. Nonetheless, Eigen alone should be able to deliver similar performance. To this end, make sure to use latest Eigen (e.g. 3.4), and proper compilation flags to enable AVX/FMA if available and multithreading:

-O3 -DNDEBUG -march=native

Since charges_ is a vector, better use a VectorXd to Eigen knows that you want a matrix-vector product and not a matrix-matrix one.

If you have Intel's MKL, then you can also let Eigen uses it to get exact same performance than MatLab for this precise operation.

Regarding the assembly, better inverse the two loops to enable vectorization, then enable multithreading with OpenMP (add -fopenmp as compiler flags) to make the outermost loop run in parallel, and finally you can simplify your code using Eigen:

void kernel_2D(const unsigned long M, double* x, const unsigned long N,  double*  y, MatrixXd& kernel)    {
    kernel.resize(M,N);
    auto x0 = ArrayXd::Map(x,M);
    auto x1 = ArrayXd::Map(x+M,M);
    auto y0 = ArrayXd::Map(y,N);
    auto y1 = ArrayXd::Map(y+N,N);
    #pragma omp parallel for
    for(unsigned long j=0;j<N;++j)
      kernel.col(j) = sqrt((x0-y0(j)).abs2() + (x1-y1(j)).abs2());
}

With multi-threading you need to measure the wall clock time. Here (Haswell with 4 physical cores running at 2.6GHz) the assembly time drops to 0.36s for N=20000, and the matrix-vector products take 0.24s so 0.6s in total that is faster than MatLab whereas my CPU seems to be slower than yours.

104

answered Sep 28 '22 08:09

ggael

Related questions
                            
                                Inherited constructors, default constructor and visibility
                            
                                Is "template argument deduction for class templates" supposed to deduce empty parameter packs for variadic class templates?
                            
                                Bitwise AND with function returning bool in C++
                            
                                Using CRTP with an interface
                            
                                How to use std::vector as the type of key for an std::unordered_map in C++?
                            
                                WINAPI C/C++ -> why did the binary size increase dramatically ? (Switch from VS2013 to VS 2015)
                            
                                I just can not understand DR 712
                            
                                Is it necessary to clean up stack contents?
                            
                                Why is this C++ code execution so slow compared to java?
                            
                                What exactly is the meaning of the footnote mentioned in [expr.ref]/1?
                            
                                Windows: Event-based Overlapped IO vs IO Completion Ports, Real World Performance
                            
                                Does std::vector::emplace() really offer the strong exception guarantee in the face of a throwing move constructor/assignment operator?
                            
                                Does std::map assign its comparator?
                            
                                Performance of std::partial_sort() versus std::sort() when sorting the whole range?
                            
                                C++ float vs double cout setprecision oddities(newbie)
                            
                                C++ - Use enum from template class without template parameter
                            
                                Shortest way to obtain a sublist of a sorted list of values lying in a certain interval
                            
                                Screen Coordinates to World Coordinates
                            
                                Docker daemon/container real-time scheduling with Ubuntu (Linux) host
                            
                                Template argument name collision with method in MSVC C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performance: Matlab vs C++ Matrix vector multiplication

Tags:

c++

performance

matrix

matlab

eigen

Fahd Siddiqui

People also ask

1 Answers

ggael

Recent Activity

Donate For Us