From what I've read about Eigen (here), it seems that operator=()
acts as a "barrier" of sorts for lazy evaluation -- e.g. it causes Eigen to stop returning expression templates and actually perform the (optimized) computation, storing the result into the left-hand side of the =
.
This would seem to mean that one's "coding style" has an impact on performance -- i.e. using named variables to store the result of intermediate computations might have a negative effect on performance by causing some portions of the computation to be evaluated "too early".
To try to verify my intuition, I wrote up an example and was surprised at the results (full code here):
using ArrayXf = Eigen::Array <float, Eigen::Dynamic, Eigen::Dynamic>; using ArrayXcf = Eigen::Array <std::complex<float>, Eigen::Dynamic, Eigen::Dynamic>; float test1( const MatrixXcf & mat ) { ArrayXcf arr = mat.array(); ArrayXcf conj = arr.conjugate(); ArrayXcf magc = arr * conj; ArrayXf mag = magc.real(); return mag.sum(); } float test2( const MatrixXcf & mat ) { return ( mat.array() * mat.array().conjugate() ).real().sum(); } float test3( const MatrixXcf & mat ) { ArrayXcf magc = ( mat.array() * mat.array().conjugate() ); ArrayXf mag = magc.real(); return mag.sum(); }
The above gives 3 different ways of computing the coefficient-wise sum of magnitudes in a complex-valued matrix.
test1
sort of takes each portion of the computation "one step at a time."test2
does the whole computation in one expression.test3
takes a "blended" approach -- with some amount of intermediate variables.I sort of expected that since test2
packs the entire computation into one expression, Eigen would be able to take advantage of that and globally optimize the entire computation, providing the best performance.
However, the results were surprising (numbers shown are in total microseconds across 1000 executions of each test):
test1_us: 154994 test2_us: 365231 test3_us: 36613
(This was compiled with g++ -O3 -- see the gist for full details.)
The version I expected to be fastest (test2
) was actually slowest. Also, the version that I expected to be slowest (test1
) was actually in the middle.
So, my questions are:
test3
perform so much better than the alternatives?In more complex computations, doing everything in one expression could hinder readability, so I'm interested in finding the right way to write code that is both readable and performant.
It looks like a problem of GCC. Intel compiler gives the expected result.
$ g++ -I ~/program/include/eigen3 -std=c++11 -O3 a.cpp -o a && ./a test1_us: 200087 test2_us: 320033 test3_us: 44539 $ icpc -I ~/program/include/eigen3 -std=c++11 -O3 a.cpp -o a && ./a test1_us: 214537 test2_us: 23022 test3_us: 42099
Compared to the icpc
version, gcc
seems to have problem optimizing your test2
.
For more precise result, you may want to turn off the debug assertions by -DNDEBUG
as shown here.
EDIT
For question 1
@ggael gives an excellent answer that gcc
fails vectorizing the sum loop. My experiment also find that test2
is as fast as the hand-written naive for-loop, both with gcc
and icc
, suggesting that vectorization is the reason, and no temporary memory allocation is detected in test2
by the method mentioned below, suggesting that Eigen evaluate the expression correctly.
For question 2
Avoiding the intermediate memory is the main purpose that Eigen use expression templates. So Eigen provides a macro EIGEN_RUNTIME_NO_MALLOC and a simple function to enable you check whether an intermediate memory is allocated during calculating the expression. You can find a sample code here. Please note this may only work in debug mode.
EIGEN_RUNTIME_NO_MALLOC - if defined, a new switch is introduced which can be turned on and off by calling set_is_malloc_allowed(bool). If malloc is not allowed and Eigen tries to allocate memory dynamically anyway, an assertion failure results. Not defined by default.
For question 3
There is a way to use intermediate variables, and to get the performance improvement introduced by lazy evaluation/expression templates at the same time.
The way is to use intermediate variables with correct data type. Instead of using Eigen::Matrix/Array
, which instructs the expression to be evaluated, you should use the expression type Eigen::MatrixBase/ArrayBase/DenseBase
so that the expression is only buffered but not evaluated. This means you should store the expression as intermediate, rather than the result of the expression, with the condition that this intermediate will only be used once in the following code.
As determing the template parameters in the expression type Eigen::MatrixBase/...
could be painful, you could use auto
instead. You could find some hints on when you should/should not use auto
/expression types in this page. Another page also tells you how to pass the expressions as function parameters without evaluating them.
According to the instructive experiment about .abs2()
in @ggael 's answer, I think another guideline is to avoid reinventing the wheel.
What happens is that because of the .real()
step, Eigen won't explicitly vectorize test2
. It will thus call the standard complex::operator* operator, which, unfortunately, is never inlined by gcc. The other versions, on the other hand, uses Eigen's own vectorized product implementation of complexes.
In contrast, ICC does inline complex::operator*, thus making the test2
the fastest for ICC. You can also rewrite test2
as:
return mat.array().abs2().sum();
to get even better performance on all compilers:
gcc: test1_us: 66016 test2_us: 26654 test3_us: 34814 icpc: test1_us: 87225 test2_us: 8274 test3_us: 44598 clang: test1_us: 87543 test2_us: 26891 test3_us: 44617
The extremely good score of ICC in this case is due to its clever auto-vectorization engine.
Another way to workaround the inlining failure of gcc without modifying test2
is to define your own operator*
for complex<float>
. For instance, add the following at the top of your file:
namespace std { complex<float> operator*(const complex<float> &a, const complex<float> &b) { return complex<float>(real(a)*real(b) - imag(a)*imag(b), imag(a)*real(b) + real(a)*imag(b)); } }
and then I get:
gcc: test1_us: 69352 test2_us: 28171 test3_us: 36501 icpc: test1_us: 93810 test2_us: 11350 test3_us: 51007 clang: test1_us: 83138 test2_us: 26206 test3_us: 45224
Of course, this trick is not always recommended as, in contrast to the glib version, it might lead to overflow or numerical cancellation issues, but this what icpc and the other vectorized versions compute anyway.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With