What does `static_cast` mean for the optimizer?

Question

When people are trying to perform rigorous benchmarks in various libraries, I sometimes see code like this:

auto std_start = std::chrono::steady_clock::now();
for (int i = 0; i < 10000; ++i)
  for (int j = 0; j < 10000; ++j)
    volatile const auto __attribute__((unused)) c = std_set.count(i + j);
auto std_stop = std::chrono::steady_clock::now();

The volatile is used here to prevent the optimizer from noticing that the result of the code under test is discarded, and then discarding the entire computation.

When the code under test doesn't return a value, say it is void do_something(int), then sometimes I see code like this:

auto std_start = std::chrono::steady_clock::now();
for (int i = 0; i < 10000; ++i)
  for (int j = 0; j < 10000; ++j)
    static_cast<volatile void> (do_something(i + j));
auto std_stop = std::chrono::steady_clock::now();

Is this correct usage of volatile? What is volatile void? What does it mean from the point of view of the compiler and the standard?

In the standard (N4296) at [dcl.type.cv] it says:

7 [ Note: volatile is a hint to the implementation to avoid aggressive optimization involving the object because the value of the object might be changed by means undetectable by an implementation. Furthermore, for some implementations, volatile might indicate that special hardware instructions are required to access the object. See 1.9 for detailed semantics. In general, the semantics of volatile are intended to be the same in C ++ as they are in C. — end note ]

In section 1.9 it specifies a lot of guidance about the execution model, but as far as volatile is concerned, it's about "accessing a volatile object". It's not clear to me what executing a statement that has been casted to volatile void means, assuming I understand the code correctly, and exactly what if any optimization barrier is produced.

Peter Cordes · Accepted Answer

static_cast<volatile void> (foo()) doesn't work as a way to require the compiler to actually compute foo() in any of gcc / clang / MSVC / ICC, with optimization enabled.

#include <bitset>

void foo() {
    for (int i = 0; i < 10000; ++i)
      for (int j = 0; j < 10000; ++j) {
        std::bitset<64> std_set(i + j);
        //volatile const auto c = std_set.count();     // real work happens
        static_cast<volatile void> (std_set.count());  // optimizes away
      }
}

compiles to just a ret with all 4 major x86 compilers. (MSVC emits asm for stand-alone definitions of std::bitset::count() or something, but scroll down for its trivial definition of foo().

(Source + asm output for this and the next example on Matt Godbolt's compiler explorer)

Maybe there are some compilers where static_cast<volatile void>() does do something, in which case it could be a lighter-weight way to write a repeat-loop that doesn't spend instructions storing the result to memory, only computing it. (This may sometimes be what you want in a microbenchmark).

Accumulating the result with tmp += foo() (or tmp |=) and returning it from main() or printing it with printf can also be useful, instead of storing into a volatile variable. Or various compiler-specific things like using an empty inline asm statement to break the compiler's ability to optimize without actually adding any instructions.

See Chandler Carruth's CppCon2015 talk on using perf to investigate compiler optimizations, where he shows an optimizer-escape function for GNU C. But his escape() function is written to require the value to be in memory (passing the asm a void* to it, with a "memory" clobber). We don't need that, we just need the compiler to have the value in a register or memory, or even an immediate constant. (It's unlikely to fully unroll our loop because it doesn't know that the asm statement is zero instructions.)

This code compiles to just the popcnt without any extra stores, on gcc.

// just force the value to be in memory, register, or even immediate
// instead of empty inline asm, use the operand in a comment so we can see what the compiler chose.  Absolutely no effect on optimization.
static void escape_integer(int a) {
  asm volatile("# value = %0" : : "g"(a));
}

// simplified with just one inner loop
void test1() {
    for (int i = 0; i < 10000; ++i) {
        std::bitset<64> std_set(i);
        int count = std_set.count();
        escape_integer(count);
    }
}

#gcc8.0 20171110 nightly -O3 -march=nehalem  (for popcnt instruction):

test1():
        # value = 0              # it peels the first iteration with an immediate 0 for the inline asm.
        mov     eax, 1
.L4:
        popcnt  rdx, rax
        # value = edx            # the inline-asm comment has the %0 filled in to show where gcc put the value
        add     rax, 1
        cmp     rax, 10000
        jne     .L4
        ret

Clang chooses to put the value in memory to satisfy the "g" constraint, which is pretty dumb. But clang does tend to do that when you give it an inline-asm constraint that includes memory as an option. So it's no better than Chandler's escape function for this.

# clang5.0 -O3 -march=nehalem
test1(): 
    xor     eax, eax
    #DEBUG_VALUE: i <- 0
.LBB1_1:                                # =>This Inner Loop Header: Depth=1
    popcnt  rcx, rax
    mov     dword ptr [rsp - 4], ecx
    # value = -4(%rsp)                # inline asm gets a value in memory
    inc     rax
    cmp     rax, 10000
    jne     .LBB1_1
    ret

ICC18 with -march=haswell does this:

test1():
    xor       eax, eax                                      #30.16
..B2.2:                         # Preds ..B2.2 ..B2.1
            # optimization report
            # %s was not vectorized: ASM code cannot be vectorized
    xor       rdx, rdx              # breaks popcnt's false dep on the destination
    popcnt    rdx, rax                                      #475.16
    inc       rax                                           #30.34
    # value = edx
    cmp       rax, 10000                                    #30.25
    jl        ..B2.2        # Prob 99%                      #30.25
    ret                                                     #35.1

That's weird, ICC used xor rdx,rdx instead of xor eax,eax. That wastes a REX prefix and isn't recognized as dependency-breaking on Silvermont/KNL.

What does `static_cast<volatile void>` mean for the optimizer?

Tags:

c++

benchmarking

void

microbenchmark

volatile

Chris Beck

1 Answers

Peter Cordes

Recent Activity

Donate For Us