Are measurable performance gains possible from using VC++'s __assume?

Question

Are measurable performance gains possible from using VC++'s __assume? If so, please post proof with code and benchmarks in your answer.

The sparse MSDN article on __assume: http://msdn.microsoft.com/en-us/library/1b3fsfxw(v=vs.100).aspx

Mentioned in the article is the use of __assume(0) to make switch statements faster by __assume(0)ing the default case. I measured no performance increase from using __assume(0) in that way:

void NoAssumeSwitchStatement(int i)
{
    switch (i)
    {
    case 0:
        vector<int>();
        break;
    case 1:
        vector<int>();
        break;
    default:
        break;
    }
}

void AssumeSwitchStatement(int i)
{
    switch (i)
    {
    case 0:
        vector<int>();
        break;
    case 1:
        vector<int>();
        break;
    default:
        __assume(0);
    }
}

int main(int argc, char* argv[])
{
    const int Iterations = 1000000;
    LARGE_INTEGER start, middle, end;
    QueryPerformanceCounter(&start);
    for (int i = 0; i < Iterations; ++i)
    {
        NoAssumeSwitchStatement(i % 2);         
    }
    QueryPerformanceCounter(&middle);
    for (int i = 0; i < Iterations; ++i)
    {
        AssumeSwitchStatement(i % 2);
    }
    QueryPerformanceCounter(&end);
    LARGE_INTEGER cpuFrequency;
    QueryPerformanceFrequency(&cpuFrequency);
    cout << "NoAssumeSwitchStatement: " << (((double)(middle.QuadPart - start.QuadPart)) * 1000) / (double)cpuFrequency.QuadPart << "ms" << endl;
    cout << "  AssumeSwitchStatement: " << (((double)(end.QuadPart - middle.QuadPart)) * 1000) / (double)cpuFrequency.QuadPart << "ms" << endl;
    return 0;
}

Rounded console output, 1000000 iterations:

NoAssumeSwitchStatement: 46ms
AssumeSwitchStatement: 46ms

Matthieu M. · Accepted Answer

Benchmark lies. They rarely measure what you want them to. In this particular case, the methods were probably inlined, and so the __assume was just redundant.

As to the actual question, yes it may help. A switch is generally implemented by a jump table, by reducing the size of this table, or removing some entries, the compiler might be able to select better CPU instructions to implement the switch.

In your extreme case, it can turn the switch into a if (i == 0) { } else { } structure, which is usually efficient.

Furthermore, trimming dead branches help keeping code tidy, and less code means a better usage of the CPU instruction cache.

However, those are micro-optimizations, and they rarely pay off: you need a profiler to point them out you, and even them it might be difficult to understand the particular transformation to make (is __assume the best ?). This is expert's work.

EDIT: In action with LLVM

void foo(void);
void bar(void);

void regular(int i) {
  switch(i) {
  case 0: foo(); break;
  case 1: bar(); break;
  }
}

void optimized(int i)  {
  switch(i) {
  case 0: foo(); break;
  case 1: bar(); break;
  default: __builtin_unreachable();
  }
}

Note that the only difference is the presence, or absence of the __builtin_unreachable() which is similar to MSVC __assume(0).

define void @regular(i32 %i) nounwind uwtable {
  switch i32 %i, label %3 [
    i32 0, label %1
    i32 1, label %2
  ]

; <label>:1                                       ; preds = %0
  tail call void @foo() nounwind
  br label %3

; <label>:2                                       ; preds = %0
  tail call void @bar() nounwind
  br label %3

; <label>:3                                       ; preds = %2, %1, %0
  ret void
}

define void @optimized(i32 %i) nounwind uwtable {
  %cond = icmp eq i32 %i, 1
  br i1 %cond, label %2, label %1

; <label>:1                                       ; preds = %0
  tail call void @foo() nounwind
  br label %3

; <label>:2                                       ; preds = %0
  tail call void @bar() nounwind
  br label %3

; <label>:3                                       ; preds = %2, %1
  ret void
}

And note here how the switch statement in regular can be optimized into a simple comparison in optimized.

This maps to the following x86 assembly:

    .globl  regular                  |      .globl  optimized
    .align  16, 0x90                 |      .align  16, 0x90
    .type   regular,@function        |      .type   optimized,@function
regular:                             |    optimized:
.Ltmp0:                              |    .Ltmp3:
    .cfi_startproc                   |            .cfi_startproc
# BB#0:                              |    # BB#0:
    cmpl    $1, %edi                 |            cmpl    $1, %edi
    je      .LBB0_3                  |            je      .LBB1_2
# BB#1:                              |
    testl    %edi, %edi              |
    jne     .LBB0_4                  |
# BB#2:                              |    # BB#1:
    jmp     foo                      |            jmp     foo
.LBB0_3:                             |    .LBB1_2:
    jmp     bar                      |            jmp     bar
.LBB0_4:                             |
    ret                              |
.Ltmp1:                              |    .Ltmp4:
    .size   regular, .Ltmp1-regular  |      .size   optimized, .Ltmp4-optimized
.Ltmp2:                              |    .Ltmp5:
    .cfi_endproc                     |      .cfi_endproc
.Leh_func_end0:                      |    .Leh_func_end1:

Note how, in the second case:

the code is tighter (less instructions)
there is a single comparison/jump (cmpl/je) on all paths (and not one path with a single jump and a path with two)

Also note how this is so close that I have no idea how to measure anything else than noise...

On the other hand, semantically it does indicate an intent, though perhaps an assert could be better suited for the semantics only.

JimR · Answer

Seems it does make a little difference if you set the right compiler switches...

Three runs follow. No optimizations, opt for speed and opt for size.

This run has no optimizations

C:	emp\code>cl /EHsc /FAscu assume.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86

assume.cpp
Microsoft (R) Incremental Linker Version 10.00.40219.01

/out:assume.exe
assume.obj

C:	emp\code>assume
NoAssumeSwitchStatement: 29.5321ms
  AssumeSwitchStatement: 31.0288ms

This is with max optimizations (/Ox) Note that /O2 was basically identical speedwise.

C:	emp\code>cl /Ox /EHsc /Fa assume.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86

assume.cpp
Microsoft (R) Incremental Linker Version 10.00.40219.01
/out:assume.exe
assume.obj

C:	emp\code>assume
NoAssumeSwitchStatement: 1.33492ms
  AssumeSwitchStatement: 0.666948ms

This run was to minimize code space

C:	emp\code>cl -O1 /EHsc /FAscu assume.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
assume.cpp
Microsoft (R) Incremental Linker Version 10.00.40219.01
/out:assume.exe
assume.obj

C:	emp\code>assume
NoAssumeSwitchStatement: 5.67691ms
  AssumeSwitchStatement: 5.36186ms

Note that the output assembly code agrees with what Matthiu M. had to say when speed opts are used. The switch functions were called in other cases.

Are measurable performance gains possible from using VC++'s __assume?

Tags:

c++

performance

compiler-optimization

visual-c++

Neil Justice

2 Answers

Matthieu M.

JimR

Recent Activity

Donate For Us

Are measurable performance gains possible from using VC++'s __assume?

Tags:

c++

performance

compiler-optimization

visual-c++

Neil Justice

2 Answers

Matthieu M.

JimR

Related questions

Recent Activity

Donate For Us