Are measurable performance gains possible from using VC++'s __assume
? If so, please post proof with code and benchmarks in your answer.
The sparse MSDN article on __assume: http://msdn.microsoft.com/en-us/library/1b3fsfxw(v=vs.100).aspx
Mentioned in the article is the use of __assume(0)
to make switch
statements faster by __assume(0)
ing the default
case. I measured no performance increase from using __assume(0)
in that way:
void NoAssumeSwitchStatement(int i)
{
switch (i)
{
case 0:
vector<int>();
break;
case 1:
vector<int>();
break;
default:
break;
}
}
void AssumeSwitchStatement(int i)
{
switch (i)
{
case 0:
vector<int>();
break;
case 1:
vector<int>();
break;
default:
__assume(0);
}
}
int main(int argc, char* argv[])
{
const int Iterations = 1000000;
LARGE_INTEGER start, middle, end;
QueryPerformanceCounter(&start);
for (int i = 0; i < Iterations; ++i)
{
NoAssumeSwitchStatement(i % 2);
}
QueryPerformanceCounter(&middle);
for (int i = 0; i < Iterations; ++i)
{
AssumeSwitchStatement(i % 2);
}
QueryPerformanceCounter(&end);
LARGE_INTEGER cpuFrequency;
QueryPerformanceFrequency(&cpuFrequency);
cout << "NoAssumeSwitchStatement: " << (((double)(middle.QuadPart - start.QuadPart)) * 1000) / (double)cpuFrequency.QuadPart << "ms" << endl;
cout << " AssumeSwitchStatement: " << (((double)(end.QuadPart - middle.QuadPart)) * 1000) / (double)cpuFrequency.QuadPart << "ms" << endl;
return 0;
}
Rounded console output, 1000000 iterations:
NoAssumeSwitchStatement: 46ms
AssumeSwitchStatement: 46ms
Benchmark lies. They rarely measure what you want them to. In this particular case, the methods were probably inlined, and so the __assume
was just redundant.
As to the actual question, yes it may help. A switch is generally implemented by a jump table, by reducing the size of this table, or removing some entries, the compiler might be able to select better CPU instructions to implement the switch
.
In your extreme case, it can turn the switch
into a if (i == 0) { } else { }
structure, which is usually efficient.
Furthermore, trimming dead branches help keeping code tidy, and less code means a better usage of the CPU instruction cache.
However, those are micro-optimizations, and they rarely pay off: you need a profiler to point them out you, and even them it might be difficult to understand the particular transformation to make (is __assume
the best ?). This is expert's work.
EDIT: In action with LLVM
void foo(void);
void bar(void);
void regular(int i) {
switch(i) {
case 0: foo(); break;
case 1: bar(); break;
}
}
void optimized(int i) {
switch(i) {
case 0: foo(); break;
case 1: bar(); break;
default: __builtin_unreachable();
}
}
Note that the only difference is the presence, or absence of the __builtin_unreachable()
which is similar to MSVC __assume(0)
.
define void @regular(i32 %i) nounwind uwtable {
switch i32 %i, label %3 [
i32 0, label %1
i32 1, label %2
]
; <label>:1 ; preds = %0
tail call void @foo() nounwind
br label %3
; <label>:2 ; preds = %0
tail call void @bar() nounwind
br label %3
; <label>:3 ; preds = %2, %1, %0
ret void
}
define void @optimized(i32 %i) nounwind uwtable {
%cond = icmp eq i32 %i, 1
br i1 %cond, label %2, label %1
; <label>:1 ; preds = %0
tail call void @foo() nounwind
br label %3
; <label>:2 ; preds = %0
tail call void @bar() nounwind
br label %3
; <label>:3 ; preds = %2, %1
ret void
}
And note here how the switch
statement in regular
can be optimized into a simple comparison in optimized
.
This maps to the following x86 assembly:
.globl regular | .globl optimized
.align 16, 0x90 | .align 16, 0x90
.type regular,@function | .type optimized,@function
regular: | optimized:
.Ltmp0: | .Ltmp3:
.cfi_startproc | .cfi_startproc
# BB#0: | # BB#0:
cmpl $1, %edi | cmpl $1, %edi
je .LBB0_3 | je .LBB1_2
# BB#1: |
testl %edi, %edi |
jne .LBB0_4 |
# BB#2: | # BB#1:
jmp foo | jmp foo
.LBB0_3: | .LBB1_2:
jmp bar | jmp bar
.LBB0_4: |
ret |
.Ltmp1: | .Ltmp4:
.size regular, .Ltmp1-regular | .size optimized, .Ltmp4-optimized
.Ltmp2: | .Ltmp5:
.cfi_endproc | .cfi_endproc
.Leh_func_end0: | .Leh_func_end1:
Note how, in the second case:
Also note how this is so close that I have no idea how to measure anything else than noise...
On the other hand, semantically it does indicate an intent, though perhaps an assert
could be better suited for the semantics only.
Seems it does make a little difference if you set the right compiler switches...
Three runs follow. No optimizations, opt for speed and opt for size.
This run has no optimizations
C:\temp\code>cl /EHsc /FAscu assume.cpp Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 assume.cpp Microsoft (R) Incremental Linker Version 10.00.40219.01 /out:assume.exe assume.obj C:\temp\code>assume NoAssumeSwitchStatement: 29.5321ms AssumeSwitchStatement: 31.0288ms
This is with max optimizations (/Ox) Note that /O2 was basically identical speedwise.
C:\temp\code>cl /Ox /EHsc /Fa assume.cpp Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 assume.cpp Microsoft (R) Incremental Linker Version 10.00.40219.01 /out:assume.exe assume.obj C:\temp\code>assume NoAssumeSwitchStatement: 1.33492ms AssumeSwitchStatement: 0.666948ms
This run was to minimize code space
C:\temp\code>cl -O1 /EHsc /FAscu assume.cpp Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 assume.cpp Microsoft (R) Incremental Linker Version 10.00.40219.01 /out:assume.exe assume.obj C:\temp\code>assume NoAssumeSwitchStatement: 5.67691ms AssumeSwitchStatement: 5.36186ms
Note that the output assembly code agrees with what Matthiu M. had to say when speed opts are used. The switch functions were called in other cases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With