When people are trying to perform rigorous benchmarks in various libraries, I sometimes see code like this:
auto std_start = std::chrono::steady_clock::now();
for (int i = 0; i < 10000; ++i)
for (int j = 0; j < 10000; ++j)
volatile const auto __attribute__((unused)) c = std_set.count(i + j);
auto std_stop = std::chrono::steady_clock::now();
The volatile
is used here to prevent the optimizer from noticing that the result of the code under test is discarded, and then discarding the entire computation.
When the code under test doesn't return a value, say it is void do_something(int)
, then sometimes I see code like this:
auto std_start = std::chrono::steady_clock::now();
for (int i = 0; i < 10000; ++i)
for (int j = 0; j < 10000; ++j)
static_cast<volatile void> (do_something(i + j));
auto std_stop = std::chrono::steady_clock::now();
Is this correct usage of volatile
? What is volatile void
? What does it mean from the point of view of the compiler and the standard?
In the standard (N4296) at [dcl.type.cv]
it says:
7 [ Note: volatile is a hint to the implementation to avoid aggressive optimization involving the object because the value of the object might be changed by means undetectable by an implementation. Furthermore, for some implementations, volatile might indicate that special hardware instructions are required to access the object. See 1.9 for detailed semantics. In general, the semantics of volatile are intended to be the same in C ++ as they are in C. — end note ]
In section 1.9 it specifies a lot of guidance about the execution model, but as far as volatile is concerned, it's about "accessing a volatile
object". It's not clear to me what executing a statement that has been casted to volatile void
means, assuming I understand the code correctly, and exactly what if any optimization barrier is produced.
static_cast<volatile void> (foo())
doesn't work as a way to require the compiler to actually compute foo()
in any of gcc / clang / MSVC / ICC, with optimization enabled.
#include <bitset>
void foo() {
for (int i = 0; i < 10000; ++i)
for (int j = 0; j < 10000; ++j) {
std::bitset<64> std_set(i + j);
//volatile const auto c = std_set.count(); // real work happens
static_cast<volatile void> (std_set.count()); // optimizes away
}
}
compiles to just a ret
with all 4 major x86 compilers. (MSVC emits asm for stand-alone definitions of std::bitset::count()
or something, but scroll down for its trivial definition of foo()
.
(Source + asm output for this and the next example on Matt Godbolt's compiler explorer)
Maybe there are some compilers where static_cast<volatile void>()
does do something, in which case it could be a lighter-weight way to write a repeat-loop that doesn't spend instructions storing the result to memory, only computing it. (This may sometimes be what you want in a microbenchmark).
Accumulating the result with tmp += foo()
(or tmp |=
) and returning it from main()
or printing it with printf
can also be useful, instead of storing into a volatile
variable. Or various compiler-specific things like using an empty inline asm
statement to break the compiler's ability to optimize without actually adding any instructions.
See Chandler Carruth's CppCon2015 talk on using perf
to investigate compiler optimizations, where he shows an optimizer-escape function for GNU C. But his escape()
function is written to require the value to be in memory (passing the asm a void*
to it, with a "memory"
clobber). We don't need that, we just need the compiler to have the value in a register or memory, or even an immediate constant. (It's unlikely to fully unroll our loop because it doesn't know that the asm statement is zero instructions.)
This code compiles to just the popcnt without any extra stores, on gcc.
// just force the value to be in memory, register, or even immediate
// instead of empty inline asm, use the operand in a comment so we can see what the compiler chose. Absolutely no effect on optimization.
static void escape_integer(int a) {
asm volatile("# value = %0" : : "g"(a));
}
// simplified with just one inner loop
void test1() {
for (int i = 0; i < 10000; ++i) {
std::bitset<64> std_set(i);
int count = std_set.count();
escape_integer(count);
}
}
#gcc8.0 20171110 nightly -O3 -march=nehalem (for popcnt instruction):
test1():
# value = 0 # it peels the first iteration with an immediate 0 for the inline asm.
mov eax, 1
.L4:
popcnt rdx, rax
# value = edx # the inline-asm comment has the %0 filled in to show where gcc put the value
add rax, 1
cmp rax, 10000
jne .L4
ret
Clang chooses to put the value in memory to satisfy the "g"
constraint, which is pretty dumb. But clang does tend to do that when you give it an inline-asm constraint that includes memory as an option. So it's no better than Chandler's escape
function for this.
# clang5.0 -O3 -march=nehalem
test1():
xor eax, eax
#DEBUG_VALUE: i <- 0
.LBB1_1: # =>This Inner Loop Header: Depth=1
popcnt rcx, rax
mov dword ptr [rsp - 4], ecx
# value = -4(%rsp) # inline asm gets a value in memory
inc rax
cmp rax, 10000
jne .LBB1_1
ret
ICC18 with -march=haswell
does this:
test1():
xor eax, eax #30.16
..B2.2: # Preds ..B2.2 ..B2.1
# optimization report
# %s was not vectorized: ASM code cannot be vectorized
xor rdx, rdx # breaks popcnt's false dep on the destination
popcnt rdx, rax #475.16
inc rax #30.34
# value = edx
cmp rax, 10000 #30.25
jl ..B2.2 # Prob 99% #30.25
ret #35.1
That's weird, ICC used xor rdx,rdx
instead of xor eax,eax
. That wastes a REX prefix and isn't recognized as dependency-breaking on Silvermont/KNL.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With