In an effort to eke out some cmov instructions on an intel core 2 running windows 7 pro I wrote the code below. All it does is take a string from the console as input, apply some shift operations to generate a random seed, and then pass that seed on to srand, for the generation of a small array of pseudorandom numbers. The pseudorandom numbers are then evaluated for whether they satisfy the predicate function ( more arbitrary bitshuffling ), and output a '*' or a '_'. The purpose of the experiment is to generate cmov instructions, but as you can see in the disassembly below, there are none.
Any tips on how to change the code or the cflags so that they'll be generated?
#include <iostream>
#include <algorithm>
#include <string>
#include <cstdlib>
bool blackBoxPredicate( const unsigned int& ubref ) {
return ((ubref << 6) ^ (ubref >> 2) ^ (~ubref << 2)) % 15 == 0;
}
int main() {
const unsigned int NUM_RINTS = 32;
unsigned int randomSeed = 1;
unsigned int popCount = 0;
unsigned int * rintArray = new unsigned int[NUM_RINTS];
std::string userString;
std::cout << "input a string to use as a random seed: ";
std::cin >> userString;
std::for_each(
userString.begin(),
userString.end(),
[&randomSeed] (char c) {
randomSeed = (randomSeed * c) ^ (randomSeed << (c % 7));
});
std::cout << "seed computed: " << randomSeed << std::endl;
srand(randomSeed);
for( int i = 0; i < NUM_RINTS; ++i ) {
rintArray[i] = static_cast<unsigned int> (rand());
bool pr = blackBoxPredicate(rintArray[i]);
popCount = (pr) ? (popCount+1) : (popCount);
std::cout << ((pr) ? ('*') : ('_')) << " ";
}
std::cout << std::endl;
delete rintArray;
return 0;
}
And used this makefile to build it:
OUT=cmov_test.exe
ASM_OUT=cmov_test.asm
OBJ_OUT=cmov_test.obj
SRC=cmov_test.cpp
THIS=makefile
CXXFLAGS=/nologo /EHsc /arch:SSE2 /Ox /W3
$(OUT): $(SRC) $(THIS)
cl $(SRC) $(CXXFLAGS) /FAscu /Fo$(OBJ_OUT) /Fa$(ASM_OUT) /Fe$(OUT)
clean:
erase $(OUT) $(ASM_OUT) $(OBJ_OUT)
And yet when I went to see whether any had been generated, I saw that microsoft's compilers had generated the following assembly for that last for loop:
; 34 : popCount = (pr) ? (popCount+1) : (popCount);
; 35 :
; 36 : std::cout << ((pr) ? ('*') : ('_')) << " ";
00145 68 00 00 00 00 push OFFSET $SG30347
0014a 85 d2 test edx, edx
0014c 0f 94 c0 sete al
0014f f6 d8 neg al
00151 1a c0 sbb al, al
00153 24 cb and al, -53 ; ffffffcbH
00155 04 5f add al, 95 ; 0000005fH
00157 0f b6 d0 movzx edx, al
0015a 52 push edx
0015b 68 00 00 00 00 push OFFSET ?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A ; std::cout
00160 e8 00 00 00 00 call ??$?6U?$char_traits@D@std@@@std@@YAAAV?$basic_ostream@DU?$char_traits@D@std@@@0@AAV10@D@Z ; std::operator<<<std::char_traits<char> >
00165 83 c4 08 add esp, 8
00168 50 push eax
00169 e8 00 00 00 00 call ??$?6U?$char_traits@D@std@@@std@@YAAAV?$basic_ostream@DU?$char_traits@D@std@@@0@AAV10@PBD@Z ; std::operator<<<std::char_traits<char> >
0016e 46 inc esi
0016f 83 c4 08 add esp, 8
00172 83 fe 20 cmp esi, 32 ; 00000020H
00175 72 a9 jb SHORT $LL3@main
For your reference, here are my cpu id strings and compiler version.
PROCESSOR_ARCHITECTURE=x86
PROCESSOR_IDENTIFIER=x86 Family 6 Model 58 Stepping 9, GenuineIntel
PROCESSOR_LEVEL=6
PROCESSOR_REVISION=3a09
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
It is extremely difficult, if not downright impossible, to get Microsoft's 32-bit C/C++ compiler to emit CMOVcc
instructions.
What you have to remember is that conditional moves were first introduced with the Pentium Pro processor, and while Microsoft had a compiler switch that would tune the generated code for this 6th generation processor (the long-deprecated /G6
), they never emitted code that would run exclusively on this processor. The code still needed to run on 5th generation processors (i.e., Pentium and AMD K6), so it couldn't use CMOVcc
instructions because those would have generated illegal instruction exceptions. Unlike Intel's compiler, global dynamic dispatching was not (and is still not) implemented.
Also, it is worth noting that no switch was ever introduced to target exclusively 6th generation processors and later. There's no /arch:CMOV
or whatever they might call it. Supported values for the /arch
switch go straight from IA32
(the lowest common denominator, for which CMOV
would be potentially illegal) to SSE
. However, the documentation does confirm that, as one might expect, enabling SSE or SSE2 code generation implicitly enables the use of conditional-move instructions and anything else that was introduced before SSE:
In addition to using the SSE and SSE2 instructions, the compiler also uses other instructions that are present on the processor revisions that support SSE and SSE2. An example is the CMOV instruction that first appeared on the Pentium Pro revision of the Intel processors.
Therefore, in order to have any hope of getting the compiler to emit CMOV
instructions, you must set /arch:SSE
or higher. Nowadays, of course, this is no big deal. You can simply set /arch:SSE
or /arch:SSE2
and be safe, since all modern processors support these instruction sets.
But that's only half of the battle. Even when you have the right compiler switches enabled, it is extremely difficult to get MSVC to emit CMOV
instructions. Here are two important observations:
MSVC 10 (Visual Studio 2010) and earlier virtually never generate CMOV
instructions. I've never seen them in the output, no matter how many variations of source code I've tried. I say "virtually" because there might be some crazy edge case that I missed, but I very much doubt it. None of the optimization flags have any effect on this.
However, MSVC 11 (Visual Studio 2012) introduced significant improvements to the code generator, at least in this aspect. This and later versions of the compiler now seem to be at least aware of the existence of the CMOVcc
instructions, and may emit them under the right conditions (i.e., /arch:SSE
or later, and use of the conditional operator, as described below).
I've found that the most effective way to cajole the compiler into emitting a CMOV
instruction is to use the conditional operator instead of a long-form if
-else
statement. Although these two constructs should be completely equivalent as far as the code generator is concerned, they are not.
In other words, while you might see the following translated to a branchless CMOVLE
instruction:
int value = (a < b) ? a : b;
you will always get branching code for the following sequence:
int value;
if (a < b) value = a;
else value = b;
At the very least, even if your use of the conditional operator doesn't cause a CMOV
instruction (such as on MSVC 10 or earlier), you might still be lucky enough to get branchless code by some other means—e.g., SETcc
or clever use of SBB
and NEG
/NOT
/INC
/DEC
. This is what the disassembly you've shown in the question uses, and although it is not quite as optimal as CMOVcc
, it's certainly comparable and the difference is not worth worrying about. (The only other branching instruction is part of the loop.)
If you truly want branchless code (which you often do when hand-optimizing), and you're not having any luck getting the compiler to generate the code you want, you'll need to get more clever in how you write the source code. I've had good luck with writing code that computes the result branchlessly using bitwise or arithmetic operators.
For example, you might wish that the following function generated optimal code:
int Minimum(int a, int b)
{
return (a < b) ? a : b;
}
You followed rule #2 and used a conditional operator, but if you're using an older version of the compiler, you'll get branching code anyway. Outsmart the compiler using the classic trick:
int Minimum_Optimized(int a, int b)
{
return (b + ((a - b) & -(a < b)));
}
The resulting object code is not perfectly optimal (it contains a CMP
instruction that is redundant since SUB
already sets the flags), but it is branchless and will therefore still be significantly faster than the original attempt on random inputs that cause branch prediction to fail.
As another example, imagine that you want to determine whether a 64-bit integer is negative in a 32-bit application. You write the following self-evident code:
bool IsNegative(int64_t value)
{
return (value < 0);
}
and will find yourself sorely disappointed in the results. GCC and Clang optimize this sensibly, but MSVC spits out a nasty conditional branch. The (non-portable) trick is realizing that the sign bit is in the upper 32 bits, so you can isolate and test that explicitly using bitwise manipulation:
bool IsNegative_Optimized(int64_t value)
{
return (static_cast<int32_t>((value & 0xFFFFFFFF00000000ULL) >> 32) < 0);
}
In addition, one of the commentators suggests using inline assembly. While this is possible (Microsoft's 32-bit compiler supports inline assembly), it is often a poor choice. Inline assembly disrupts the optimizer in rather significant ways, so unless you're writing significant swaths of code in inline assembly, there is unlikely to be a substantial net performance gain. Furthermore, Microsoft's inline assembly syntax is extremely limited. It trades flexibility for simplicity in a big way. In particular, there is no way to specify input values, so you're stuck loading the input from memory into a register, and the caller is forced to spill the input from a register to memory in preparation. This creates a phenomenon I like to call "a whole lotta shufflin' goin' on", or for short, "slow code". You don't drop to inline assembly in cases where slow code is acceptable. Thus, it is always preferable (at least on MSVC) to figure out how to write C/C++ source code that persuades the compiler to emit the object code you want. Even if you can only get close to the ideal output, that's still considerably better than the penalty you pay for using inline assembly.
Note that none of these contortions are necessary if you target x86-64. Microsoft's 64-bit C/C++ compiler is significantly more aggressive about using CMOVcc
instructions whenever possible, even the older versions. As this blog post explains, the x64 compiler bundled with Visual Studio 2010 contains a number of code-quality improvements, including better identification and use of CMOV
instructions.
No special compiler flags or other considerations are necessary here, since all processors that support 64-bit mode support conditional moves. I suppose this is why they were able to get it right for the 64-bit compiler. I also suspect that some of these changes made to the x86-64 compiler in VS 2010 were ported to the x86-32 compiler in VS 2012, explaining why it is at least aware of the existence of CMOV
, but it still doesn't use it as aggressively as the 64-bit compiler.
The bottom line is, when targeting x86-64, write the code in the way that makes the most sense. The optimizer actually knows how to do its job!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With