Generating CMOV instructions using Microsoft compilers

Question

In an effort to eke out some cmov instructions on an intel core 2 running windows 7 pro I wrote the code below. All it does is take a string from the console as input, apply some shift operations to generate a random seed, and then pass that seed on to srand, for the generation of a small array of pseudorandom numbers. The pseudorandom numbers are then evaluated for whether they satisfy the predicate function ( more arbitrary bitshuffling ), and output a '*' or a '_'. The purpose of the experiment is to generate cmov instructions, but as you can see in the disassembly below, there are none.

Any tips on how to change the code or the cflags so that they'll be generated?

#include <iostream>
#include <algorithm>
#include <string>
#include <cstdlib>

bool blackBoxPredicate( const unsigned int& ubref ) {
   return ((ubref << 6) ^ (ubref >> 2) ^ (~ubref << 2)) % 15 == 0;
}

int main() {
   const unsigned int NUM_RINTS = 32;
   unsigned int randomSeed = 1;
   unsigned int popCount = 0;
   unsigned int * rintArray = new unsigned int[NUM_RINTS];
   std::string userString;

   std::cout << "input a string to use as a random seed: ";
   std::cin >> userString;

   std::for_each( 
      userString.begin(), 
      userString.end(), 
      [&randomSeed] (char c) {
         randomSeed = (randomSeed * c) ^ (randomSeed << (c % 7));
   });

   std::cout << "seed computed: " << randomSeed << std::endl;

   srand(randomSeed);

   for( int i = 0; i < NUM_RINTS; ++i ) {
      rintArray[i] = static_cast<unsigned int> (rand());
      bool pr = blackBoxPredicate(rintArray[i]);
      popCount = (pr) ? (popCount+1) : (popCount);

      std::cout << ((pr) ? ('*') : ('_')) << " ";
   }

   std::cout << std::endl;

   delete rintArray;
   return 0;
}

And used this makefile to build it:

OUT=cmov_test.exe
ASM_OUT=cmov_test.asm
OBJ_OUT=cmov_test.obj
SRC=cmov_test.cpp
THIS=makefile

CXXFLAGS=/nologo /EHsc /arch:SSE2 /Ox /W3

$(OUT): $(SRC) $(THIS)
   cl $(SRC) $(CXXFLAGS) /FAscu /Fo$(OBJ_OUT) /Fa$(ASM_OUT) /Fe$(OUT)

clean:
   erase $(OUT) $(ASM_OUT) $(OBJ_OUT)

And yet when I went to see whether any had been generated, I saw that microsoft's compilers had generated the following assembly for that last for loop:

; 34   :       popCount = (pr) ? (popCount+1) : (popCount);
; 35   :       
; 36   :       std::cout << ((pr) ? ('*') : ('_')) << " ";

  00145 68 00 00 00 00   push    OFFSET $SG30347
  0014a 85 d2        test    edx, edx
  0014c 0f 94 c0     sete    al
  0014f f6 d8        neg     al
  00151 1a c0        sbb     al, al
  00153 24 cb        and     al, -53            ; ffffffcbH
  00155 04 5f        add     al, 95         ; 0000005fH
  00157 0f b6 d0     movzx   edx, al
  0015a 52       push    edx
  0015b 68 00 00 00 00   push    OFFSET ?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A ; std::cout
  00160 e8 00 00 00 00   call    ??$?6U?$char_traits@D@std@@@std@@YAAAV?$basic_ostream@DU?$char_traits@D@std@@@0@AAV10@D@Z ; std::operator<<<std::char_traits<char> >
  00165 83 c4 08     add     esp, 8
  00168 50       push    eax
  00169 e8 00 00 00 00   call    ??$?6U?$char_traits@D@std@@@std@@YAAAV?$basic_ostream@DU?$char_traits@D@std@@@0@AAV10@PBD@Z ; std::operator<<<std::char_traits<char> >
  0016e 46       inc     esi
  0016f 83 c4 08     add     esp, 8
  00172 83 fe 20     cmp     esi, 32            ; 00000020H
  00175 72 a9        jb  SHORT $LL3@main

For your reference, here are my cpu id strings and compiler version.

PROCESSOR_ARCHITECTURE=x86
PROCESSOR_IDENTIFIER=x86 Family 6 Model 58 Stepping 9, GenuineIntel
PROCESSOR_LEVEL=6
PROCESSOR_REVISION=3a09

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86

Cody Gray · Accepted Answer

It is extremely difficult, if not downright impossible, to get Microsoft's 32-bit C/C++ compiler to emit CMOVcc instructions.

What you have to remember is that conditional moves were first introduced with the Pentium Pro processor, and while Microsoft had a compiler switch that would tune the generated code for this 6th generation processor (the long-deprecated /G6), they never emitted code that would run exclusively on this processor. The code still needed to run on 5th generation processors (i.e., Pentium and AMD K6), so it couldn't use CMOVcc instructions because those would have generated illegal instruction exceptions. Unlike Intel's compiler, global dynamic dispatching was not (and is still not) implemented.

Also, it is worth noting that no switch was ever introduced to target exclusively 6th generation processors and later. There's no /arch:CMOV or whatever they might call it. Supported values for the /arch switch go straight from IA32 (the lowest common denominator, for which CMOV would be potentially illegal) to SSE. However, the documentation does confirm that, as one might expect, enabling SSE or SSE2 code generation implicitly enables the use of conditional-move instructions and anything else that was introduced before SSE:

In addition to using the SSE and SSE2 instructions, the compiler also uses other instructions that are present on the processor revisions that support SSE and SSE2. An example is the CMOV instruction that first appeared on the Pentium Pro revision of the Intel processors.

Therefore, in order to have any hope of getting the compiler to emit CMOV instructions, you must set /arch:SSE or higher. Nowadays, of course, this is no big deal. You can simply set /arch:SSE or /arch:SSE2 and be safe, since all modern processors support these instruction sets.

But that's only half of the battle. Even when you have the right compiler switches enabled, it is extremely difficult to get MSVC to emit CMOV instructions. Here are two important observations:

MSVC 10 (Visual Studio 2010) and earlier virtually never generate CMOV instructions. I've never seen them in the output, no matter how many variations of source code I've tried. I say "virtually" because there might be some crazy edge case that I missed, but I very much doubt it. None of the optimization flags have any effect on this.

However, MSVC 11 (Visual Studio 2012) introduced significant improvements to the code generator, at least in this aspect. This and later versions of the compiler now seem to be at least aware of the existence of the CMOVcc instructions, and may emit them under the right conditions (i.e., /arch:SSE or later, and use of the conditional operator, as described below).
I've found that the most effective way to cajole the compiler into emitting a CMOV instruction is to use the conditional operator instead of a long-form if-else statement. Although these two constructs should be completely equivalent as far as the code generator is concerned, they are not.

In other words, while you might see the following translated to a branchless CMOVLE instruction:
```
int value = (a < b) ? a : b;
```
you will always get branching code for the following sequence:
```
int value;
if (a < b)    value = a;
else          value = b;
```
At the very least, even if your use of the conditional operator doesn't cause a CMOV instruction (such as on MSVC 10 or earlier), you might still be lucky enough to get branchless code by some other means—e.g., SETcc or clever use of SBB and NEG/NOT/INC/DEC. This is what the disassembly you've shown in the question uses, and although it is not quite as optimal as CMOVcc, it's certainly comparable and the difference is not worth worrying about. (The only other branching instruction is part of the loop.)

If you truly want branchless code (which you often do when hand-optimizing), and you're not having any luck getting the compiler to generate the code you want, you'll need to get more clever in how you write the source code. I've had good luck with writing code that computes the result branchlessly using bitwise or arithmetic operators.

For example, you might wish that the following function generated optimal code:

int Minimum(int a, int b)
{
    return (a < b) ? a : b;
}

You followed rule #2 and used a conditional operator, but if you're using an older version of the compiler, you'll get branching code anyway. Outsmart the compiler using the classic trick:

int Minimum_Optimized(int a, int b)
{
    return (b + ((a - b) & -(a < b)));
}

The resulting object code is not perfectly optimal (it contains a CMP instruction that is redundant since SUB already sets the flags), but it is branchless and will therefore still be significantly faster than the original attempt on random inputs that cause branch prediction to fail.

As another example, imagine that you want to determine whether a 64-bit integer is negative in a 32-bit application. You write the following self-evident code:

bool IsNegative(int64_t value)
{
    return (value < 0);
}

and will find yourself sorely disappointed in the results. GCC and Clang optimize this sensibly, but MSVC spits out a nasty conditional branch. The (non-portable) trick is realizing that the sign bit is in the upper 32 bits, so you can isolate and test that explicitly using bitwise manipulation:

bool IsNegative_Optimized(int64_t value)
{
    return (static_cast<int32_t>((value & 0xFFFFFFFF00000000ULL) >> 32) < 0);
}

In addition, one of the commentators suggests using inline assembly. While this is possible (Microsoft's 32-bit compiler supports inline assembly), it is often a poor choice. Inline assembly disrupts the optimizer in rather significant ways, so unless you're writing significant swaths of code in inline assembly, there is unlikely to be a substantial net performance gain. Furthermore, Microsoft's inline assembly syntax is extremely limited. It trades flexibility for simplicity in a big way. In particular, there is no way to specify input values, so you're stuck loading the input from memory into a register, and the caller is forced to spill the input from a register to memory in preparation. This creates a phenomenon I like to call "a whole lotta shufflin' goin' on", or for short, "slow code". You don't drop to inline assembly in cases where slow code is acceptable. Thus, it is always preferable (at least on MSVC) to figure out how to write C/C++ source code that persuades the compiler to emit the object code you want. Even if you can only get close to the ideal output, that's still considerably better than the penalty you pay for using inline assembly.

Note that none of these contortions are necessary if you target x86-64. Microsoft's 64-bit C/C++ compiler is significantly more aggressive about using CMOVcc instructions whenever possible, even the older versions. As this blog post explains, the x64 compiler bundled with Visual Studio 2010 contains a number of code-quality improvements, including better identification and use of CMOV instructions.

No special compiler flags or other considerations are necessary here, since all processors that support 64-bit mode support conditional moves. I suppose this is why they were able to get it right for the 64-bit compiler. I also suspect that some of these changes made to the x86-64 compiler in VS 2010 were ported to the x86-32 compiler in VS 2012, explaining why it is at least aware of the existence of CMOV, but it still doesn't use it as aggressively as the 64-bit compiler.

The bottom line is, when targeting x86-64, write the code in the way that makes the most sense. The optimizer actually knows how to do its job!

Generating CMOV instructions using Microsoft compilers

Tags:

c++

x86

assembly

visual-c++

disassembly

Max DeLiso

1 Answers

Cody Gray

Recent Activity

Donate For Us

Generating CMOV instructions using Microsoft compilers

Tags:

c++

x86

assembly

visual-c++

disassembly

Max DeLiso

1 Answers

Cody Gray

Related questions

Recent Activity

Donate For Us