Python numpy code more efficient than eigen3 or plain C++

Question

I had some code in Python3 (with numpy) that I wanted to convert to C++ (with eigen3) in order to get a more efficient program. So I decided to test a simple example to assess the performance gain I would get. The code consists on two random arrays that are to be multiplied coefficient-wise. My conclusions were that the python code with numpy is about 30% faster than the one in C++. I'd like to know why the interpreted python code is faster than a compiled C++ code. Am I missing something in the C++ code?

I'm using gcc 9.1.0, Eigen 3.3.7, Python 3.7.3 and Numpy 1.16.4.

Possible explanations:

C++ program isn't using vectorization
Numpy is a lot more optimized than I thought
Time is measuring different things in each program

There is a similar question in Stack Overflow (Eigen Matrix vs Numpy Array multiplication performance). I tested this in my computer and got the expected result that eigen is more efficient than numpy, but the operation here is matrix multiplication rather than coefficient-wise multiplication.

Python code (main.py)
Execution command: python3 main.py

import numpy as np
import time

Lx = 4096
Ly = 4000

# Filling arrays
a = np.random.rand(Lx, Ly).astype(np.float64)
a1 = np.random.rand(Lx, Ly).astype(np.float64)

# Coefficient-wise product
start = time.time()
b = a*a1

# Compute the elapsed time
end = time.time()

print(b.sum())
print("duration: ", end-start)

C++ code with eigen3 (main_eigen.cpp)
Compilation command: g++ -O3 -I/usr/include/eigen3/ main_eigen.cpp -o prog_eigen

#include <iostream>
#include <chrono>
#include "Eigen/Dense"

#define Lx 4096
#define Ly 4000
typedef double T;

int main(){

    // Allocating arrays
    Eigen::Array<T, -1, -1> KPM_ghosts(Lx, Ly), KPM_ghosts1(Lx, Ly), b(Lx,Ly);

    // Filling the arrays
    KPM_ghosts.setRandom();
    KPM_ghosts1.setRandom();

    // Coefficient-wise product
    auto start = std::chrono::system_clock::now();
    b = KPM_ghosts*KPM_ghosts1;

    // Compute the elapsed time
    auto end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s
";

    // Print the sum so the compiler doesn't optimize the code away
    std::cout << b.sum() << "
";

    return 0;
}

Plain C++ code (main.cpp)
Compilation command: g++ -O3 main.cpp -o prog

#include <iostream>
#include <chrono>

#define Lx 4096
#define Ly 4000
#define N Lx*Ly
typedef double T;

int main(){
    // Allocating arrays
    T lin_vector1[N];
    T lin_vector2[N];
    T lin_vector3[N];

    // Filling the arrays
    for(unsigned i = 0; i < N; i++){
        lin_vector1[i] = std::rand()*1.0/RAND_MAX;
        lin_vector2[i] = std::rand()*1.0/RAND_MAX;
    }

    // Coefficient-wise product
    auto start = std::chrono::system_clock::now();
    for(unsigned i = 0; i < N; i++)
        lin_vector3[i] = lin_vector1[i]*lin_vector2[i];

    // Compute the elapsed time
    auto end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s
";

    // Print the sum so the compiler doesn't optimize the code away
    double sum = 0;
    for(unsigned i = 0; i < N; i++)
        sum += lin_vector3[i];
    std::cout << "sum: " << sum << "
";


    return 0;
}

Runtime of each program 10 times

Plain C++
elapsed time: 0.210664s
elapsed time: 0.215406s
elapsed time: 0.222483s
elapsed time: 0.21526s
elapsed time: 0.216346s
elapsed time: 0.218951s
elapsed time: 0.21587s
elapsed time: 0.213639s
elapsed time: 0.219399s
elapsed time: 0.213403s

Plain C++ with eigen3
elapsed time: 0.21052s
elapsed time: 0.220779s
elapsed time: 0.216269s
elapsed time: 0.229234s
elapsed time: 0.212265s
elapsed time: 0.256714s
elapsed time: 0.212396s
elapsed time: 0.248241s
elapsed time: 0.241537s
elapsed time: 0.323519s

Python
duration: 0.23946428298950195
duration: 0.1663036346435547
duration: 0.17225909233093262
duration: 0.15922021865844727
duration: 0.16628384590148926
duration: 0.15654635429382324
duration: 0.15859222412109375
duration: 0.1633443832397461
duration: 0.1685199737548828
duration: 0.16393446922302246

SpaceCadetPinballer · Accepted Answer

I would like to add a couple of hypotheses to the above comments.

One is that numpy is doing multithreading. Your C++ is compiled with -O3, which usually already gives a good speedup. I assume numpy is not compiled with -O3 or other optimizations in the default PyPI packages. Yet it's significantly faster. One way for that to happen is if it were slow to begin with but used multiple CPU cores.

One way to check is to make it use only one thread by setting the variables mentioned here:

OMP_NUM_THREADS=1 MPI_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1

Alternatively, or simultaneously with the above, it could also be due to an optimized build such as the MKL build you can install from Anaconda. As the comments above suggest, you could also see how much using SSE or AVX in the C++ code improves its performance, using a compiler flag such as -march=native.

Python numpy code more efficient than eigen3 or plain C++

Tags:

c++

performance

python

numpy

eigen3

Sermal

1 Answers

SpaceCadetPinballer

Recent Activity

Donate For Us

Python numpy code more efficient than eigen3 or plain C++

Tags:

c++

performance

python

numpy

eigen3

Sermal

1 Answers

SpaceCadetPinballer

Related questions

Recent Activity

Donate For Us