I've found that != and == are not the fastest ways for testing for zero or non-zero. <pre class="prettyprint"><code>bool nonZero1 = integer != 0; xor eax, eax test ecx, ecx setne al bool nonZero2 = integer < 0 || integer > 0; test ecx, ecx setne al bool zero1 = integer == 0; xor eax, eax test ecx, ecx sete al bool zero2 = !(integer < 0 || integer > 0); test ecx, ecx sete al </code></pre> Compiler: VC++ 11 Optimization flags: /O2 /GL /LTCG This is the assembly output for x86-32. The second versions of both comparisons were ~12% faster on both x86-32 and x86-64. However, on x86-64 the instructions were identical (first versions looked exactly like the second versions), but the second versions were still faster. <ol> <li>Why doesn't the compiler generate the faster version on x86-32?</li> <li>Why are the second versions still faster on x86-64 when the assembly output is identical?</li> </ol> EDIT: I've added benchmarking code. ZERO: 1544ms, 1358ms NON_ZERO: 1544ms, 1358ms http://pastebin.com/m7ZSUrcP or http://anonymouse.org/cgi-bin/anon-www.cgi/http://pastebin.com/m7ZSUrcP Note: It's probably inconvenient to locate these functions when compiled in a single source file, because main.asm goes quite big. I had zero1, zero2, nonZero1, nonZero2 in a separate source file. EDIT2: Could someone with both VC++11 and VC++2010 installed run the benchmarking code and post the timings? It might indeed be a bug in VC++11.

This is a great question, but I think you've fallen victim to the compiler's dependency analysis. The compiler only has to clear the high bits of <code>eax</code> once, and they remain clear for the second version. The second version would have to pay the price to <code>xor eax, eax</code> except that the compiler analysis proved it's been left cleared by the first version. The second version is able to "cheat" by taking advantage of work the compiler did in the first version. How are you measuring times? Is it "(version one, followed by version two) in a loop", or "(version one in a loop) followed by (version two in a loop)"? Don't do both tests in the same program (instead recompile for each version), or if you do, test both "version A first" and "version B first" and see if whichever comes first is paying a penalty. <hr> Illustration of the cheating: <pre class="prettyprint"><code>timer1.start(); double x1 = 2 * sqrt(n + 37 * y + exp(z)); timer1.stop(); timer2.start(); double x2 = 31 * sqrt(n + 37 * y + exp(z)); timer2.stop(); </code></pre> If <code>timer2</code> duration is less than <code>timer1</code> duration, we don't conclude that multiplying by 31 is faster than multiplying by 2. Instead, we realize that the compiler performed common subexpression analysis, and the code became: <pre class="prettyprint"><code>timer1.start(); double common = sqrt(n + 37 * y + exp(z)); double x1 = 2 * common; timer1.stop(); timer2.start(); double x2 = 31 * common; timer2.stop(); </code></pre> And the only thing proved is that multiplying by 31 is faster than computing <code>common</code>. Which is hardly surprising at all -- multiplication is far far faster than <code>sqrt</code> and <code>exp</code>.

int operators != and == when comparing to zero

Tags:

c++

performance

assembly

machine-code

I've found that != and == are not the fastest ways for testing for zero or non-zero.

bool nonZero1 = integer != 0; xor eax, eax test ecx, ecx setne al  bool nonZero2 = integer < 0 || integer > 0; test ecx, ecx setne al  bool zero1 = integer == 0; xor eax, eax test ecx, ecx sete al  bool zero2 = !(integer < 0 || integer > 0); test ecx, ecx sete al

Compiler: VC++ 11 Optimization flags: /O2 /GL /LTCG

This is the assembly output for x86-32. The second versions of both comparisons were ~12% faster on both x86-32 and x86-64. However, on x86-64 the instructions were identical (first versions looked exactly like the second versions), but the second versions were still faster.

Why doesn't the compiler generate the faster version on x86-32?
Why are the second versions still faster on x86-64 when the assembly output is identical?

EDIT: I've added benchmarking code. ZERO: 1544ms, 1358ms NON_ZERO: 1544ms, 1358ms http://pastebin.com/m7ZSUrcP or http://anonymouse.org/cgi-bin/anon-www.cgi/http://pastebin.com/m7ZSUrcP

Note: It's probably inconvenient to locate these functions when compiled in a single source file, because main.asm goes quite big. I had zero1, zero2, nonZero1, nonZero2 in a separate source file.

EDIT2: Could someone with both VC++11 and VC++2010 installed run the benchmarking code and post the timings? It might indeed be a bug in VC++11.

481

asked May 31 '12 17:05

NFRCR

1 Answers

This is a great question, but I think you've fallen victim to the compiler's dependency analysis.

The compiler only has to clear the high bits of eax once, and they remain clear for the second version. The second version would have to pay the price to xor eax, eax except that the compiler analysis proved it's been left cleared by the first version.

The second version is able to "cheat" by taking advantage of work the compiler did in the first version.

How are you measuring times? Is it "(version one, followed by version two) in a loop", or "(version one in a loop) followed by (version two in a loop)"?

Don't do both tests in the same program (instead recompile for each version), or if you do, test both "version A first" and "version B first" and see if whichever comes first is paying a penalty.

Illustration of the cheating:

timer1.start(); double x1 = 2 * sqrt(n + 37 * y + exp(z)); timer1.stop(); timer2.start(); double x2 = 31 * sqrt(n + 37 * y + exp(z)); timer2.stop();

If timer2 duration is less than timer1 duration, we don't conclude that multiplying by 31 is faster than multiplying by 2. Instead, we realize that the compiler performed common subexpression analysis, and the code became:

timer1.start(); double common = sqrt(n + 37 * y + exp(z)); double x1 = 2 * common; timer1.stop(); timer2.start(); double x2 = 31 * common; timer2.stop();

And the only thing proved is that multiplying by 31 is faster than computing common. Which is hardly surprising at all -- multiplication is far far faster than sqrt and exp.

131

answered Sep 19 '22 18:09

Ben Voigt

Related questions
                            
                                What is the lifetime of a C++ lambda expression?
                            
                                C++11: Correct std::array initialization?
                            
                                fixed length data types in C/C++
                            
                                How to speed up g++ compile time (when using a lot of templates)
                            
                                Fast textfile reading in c++
                            
                                Export all symbols when creating a DLL
                            
                                Enable C++11 support on Android
                            
                                Why are NULL pointers defined differently in C and C++?
                            
                                Can we reassign the reference in C++?
                            
                                C++ view types: pass by const& or by value?
                            
                                C++17: Keep only some members when tuple unpacking
                            
                                How do I decide whether to use ATL, MFC, Win32 or CLR for a new C++ project?
                            
                                A lambda's return type can be deduced by the return value, so why can't a function's?
                            
                                Why aren't my include guards preventing recursive inclusion and multiple symbol definitions?
                            
                                do I need to close a std::fstream? [duplicate]
                            
                                Why would the behavior of std::memcpy be undefined for objects that are not TriviallyCopyable?
                            
                                How do you find what version of libstdc++ library is installed on your linux machine?
                            
                                What does the g stand for in gcount, tellg and seekg?
                            
                                Why do I get an error trying to call a template member function with an explicit type parameter?
                            
                                What does the "lock" instruction mean in x86 assembly?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With