I have some code that is more or less like this: <pre class="prettyprint"><code>#include <bitset> enum Flags { A = 1, B = 2, C = 3, D = 5, E = 8, F = 13, G = 21, H, I, J, K, L, M, N, O }; void apply_known_mask(std::bitset<64> &bits) { const Flags important_bits[] = { B, D, E, H, K, M, L, O }; std::remove_reference<decltype(bits)>::type mask{}; for (const auto& bit : important_bits) { mask.set(bit); } bits &= mask; } </code></pre> Clang >= 3.6 does the smart thing and compiles this to a single <code>and</code> instruction (which then gets inlined everywhere else): <pre class="prettyprint"><code>apply_known_mask(std::bitset<64ul>&): # @apply_known_mask(std::bitset<64ul>&) and qword ptr [rdi], 775946532 ret </code></pre> But every version of GCC I've tried compiles this to an enormous mess that includes error handling that should be statically DCE'd. In other code, it will even place the <code>important_bits</code> equivalent as data in line with the code! <pre class="prettyprint"><code>.LC0: .string "bitset::set" .LC1: .string "%s: __position (which is %zu) >= _Nb (which is %zu)" apply_known_mask(std::bitset<64ul>&): sub rsp, 40 xor esi, esi mov ecx, 2 movabs rax, 21474836482 mov QWORD PTR [rsp], rax mov r8d, 1 movabs rax, 94489280520 mov QWORD PTR [rsp+8], rax movabs rax, 115964117017 mov QWORD PTR [rsp+16], rax movabs rax, 124554051610 mov QWORD PTR [rsp+24], rax mov rax, rsp jmp .L2 .L3: mov edx, DWORD PTR [rax] mov rcx, rdx cmp edx, 63 ja .L7 .L2: mov rdx, r8 add rax, 4 sal rdx, cl lea rcx, [rsp+32] or rsi, rdx cmp rax, rcx jne .L3 and QWORD PTR [rdi], rsi add rsp, 40 ret .L7: mov ecx, 64 mov esi, OFFSET FLAT:.LC0 mov edi, OFFSET FLAT:.LC1 xor eax, eax call std::__throw_out_of_range_fmt(char const*, ...) </code></pre> How should I write this code so that both compilers can do the right thing? Failing that, how should I write this so that it remains clear, fast, and maintainable?

Best version is c++17: <pre class="prettyprint"><code>template< unsigned char... indexes > constexpr unsigned long long mask(){ return ((1ull<<indexes)|...|0ull); } </code></pre> Then <pre class="prettyprint"><code>void apply_known_mask(std::bitset<64> &bits) { constexpr auto m = mask<B,D,E,H,K,M,L,O>(); bits &= m; } </code></pre> back in c++14, we can do this strange trick: <pre class="prettyprint"><code>template< unsigned char... indexes > constexpr unsigned long long mask(){ auto r = 0ull; using discard_t = int[]; // data never used // value never used: discard_t discard = {0,(void( r |= (1ull << indexes) // side effect, used ),0)...}; (void)discard; // block unused var warnings return r; } </code></pre> or, if we are stuck with c++11, we can solve it recursively: <pre class="prettyprint"><code>constexpr unsigned long long mask(){ return 0; } template<class...Tail> constexpr unsigned long long mask(unsigned char b0, Tail...tail){ return (1ull<<b0) | mask(tail...); } template< unsigned char... indexes > constexpr unsigned long long mask(){ return mask(indexes...); } </code></pre> Godbolt with all 3 -- you can switch CPP_VERSION define, and get identical assembly. In practice I'd use the most modern I could. 14 beats 11 because we don't have recursion and hence O(n^2) symbol length (which can explode compile time and compiler memory usage); 17 beats 14 because the compiler doesn't have to dead-code-eliminate that array, and that array trick is just ugly. Of these 14 is the most confusing. Here we create an anonymous array of all 0s, meanwhile as a side effect construct our result, then discard the array. The discarded array has a number of 0s in it equal to the size of our pack, plus 1 (which we add so we can handle empty packs). <hr> A detailed explanation of what the c++14 version is doing. This is a trick/hack, and the fact you have to do this to expand parameters packs with efficiency in C++14 is one of the reasons why fold expressions were added in c++17. It is best understood from the inside out: <pre class="prettyprint"><code> r |= (1ull << indexes) // side effect, used </code></pre> this just updates <code>r</code> with <code>1<<indexes</code> for a fixed index. <code>indexes</code> is a parameter pack, so we'll have to expand it. The rest of the work is to provide a parameter pack to expand <code>indexes</code> inside of. One step out: <pre class="prettyprint"><code>(void( r |= (1ull << indexes) // side effect, used ),0) </code></pre> here we cast our expression to <code>void</code>, indicating we don't care about its return value (we just want the side effect of setting <code>r</code> -- in C++, expressions like <code>a |= b</code> also return the value they set <code>a</code> to). Then we use the comma operator <code>,</code> and <code>0</code> to discard the <code>void</code> "value", and return the value <code>0</code>. So this is an expression whose value is <code>0</code> and as a side effect of calculating <code>0</code> it sets a bit in <code>r</code>. <pre class="prettyprint"><code> int discard[] = {0,(void( r |= (1ull << indexes) // side effect, used ),0)...}; </code></pre> At this point, we expand the parameter pack <code>indexes</code>. So we get: <pre class="prettyprint"><code> { 0, (expression that sets a bit and returns 0), (expression that sets a bit and returns 0), [...] (expression that sets a bit and returns 0), } </code></pre> in the <code>{}</code>. This use of <code>,</code> is not the comma operator, but rather the array element separator. This is <code>sizeof...(indexes)+1</code> <code>0</code>s, which also set bits in <code>r</code> as a side effect. We then assign the <code>{}</code> array construction instructions to an array <code>discard</code>. Next we cast <code>discard</code> to <code>void</code> -- most compilers will warn you if you create a variable and never read it. All compilers will not complain if you cast it to <code>void</code>, it is sort of a way to say "Yes, I know, I'm not using this", so it suppresses the warning.

How do I write a maintainable, fast, compile-time bit-mask in C++?

Tags:

c++

bit-manipulation

c++11

I have some code that is more or less like this:

#include <bitset>

enum Flags { A = 1, B = 2, C = 3, D = 5,
             E = 8, F = 13, G = 21, H,
             I, J, K, L, M, N, O };

void apply_known_mask(std::bitset<64> &bits) {
    const Flags important_bits[] = { B, D, E, H, K, M, L, O };
    std::remove_reference<decltype(bits)>::type mask{};
    for (const auto& bit : important_bits) {
        mask.set(bit);
    }

    bits &= mask;
}

Clang >= 3.6 does the smart thing and compiles this to a single and instruction (which then gets inlined everywhere else):

apply_known_mask(std::bitset<64ul>&):  # @apply_known_mask(std::bitset<64ul>&)
        and     qword ptr [rdi], 775946532
        ret

But every version of GCC I've tried compiles this to an enormous mess that includes error handling that should be statically DCE'd. In other code, it will even place the important_bits equivalent as data in line with the code!

.LC0:
        .string "bitset::set"
.LC1:
        .string "%s: __position (which is %zu) >= _Nb (which is %zu)"
apply_known_mask(std::bitset<64ul>&):
        sub     rsp, 40
        xor     esi, esi
        mov     ecx, 2
        movabs  rax, 21474836482
        mov     QWORD PTR [rsp], rax
        mov     r8d, 1
        movabs  rax, 94489280520
        mov     QWORD PTR [rsp+8], rax
        movabs  rax, 115964117017
        mov     QWORD PTR [rsp+16], rax
        movabs  rax, 124554051610
        mov     QWORD PTR [rsp+24], rax
        mov     rax, rsp
        jmp     .L2
.L3:
        mov     edx, DWORD PTR [rax]
        mov     rcx, rdx
        cmp     edx, 63
        ja      .L7
.L2:
        mov     rdx, r8
        add     rax, 4
        sal     rdx, cl
        lea     rcx, [rsp+32]
        or      rsi, rdx
        cmp     rax, rcx
        jne     .L3
        and     QWORD PTR [rdi], rsi
        add     rsp, 40
        ret
.L7:
        mov     ecx, 64
        mov     esi, OFFSET FLAT:.LC0
        mov     edi, OFFSET FLAT:.LC1
        xor     eax, eax
        call    std::__throw_out_of_range_fmt(char const*, ...)

How should I write this code so that both compilers can do the right thing? Failing that, how should I write this so that it remains clear, fast, and maintainable?

824

asked Oct 09 '22 18:10

Alex Reinking

2 Answers

Best version is c++17:

template< unsigned char... indexes >
constexpr unsigned long long mask(){
  return ((1ull<<indexes)|...|0ull);
}

Then

void apply_known_mask(std::bitset<64> &bits) {
  constexpr auto m = mask<B,D,E,H,K,M,L,O>();
  bits &= m;
}

back in c++14, we can do this strange trick:

template< unsigned char... indexes >
constexpr unsigned long long mask(){
  auto r = 0ull;
  using discard_t = int[]; // data never used
  // value never used:
  discard_t discard = {0,(void(
    r |= (1ull << indexes) // side effect, used
  ),0)...};
  (void)discard; // block unused var warnings
  return r;
}

or, if we are stuck with c++11, we can solve it recursively:

constexpr unsigned long long mask(){
  return 0;
}
template<class...Tail>
constexpr unsigned long long mask(unsigned char b0, Tail...tail){
  return (1ull<<b0) | mask(tail...);
}
template< unsigned char... indexes >
constexpr unsigned long long mask(){
  return mask(indexes...);
}

Godbolt with all 3 -- you can switch CPP_VERSION define, and get identical assembly.

In practice I'd use the most modern I could. 14 beats 11 because we don't have recursion and hence O(n^2) symbol length (which can explode compile time and compiler memory usage); 17 beats 14 because the compiler doesn't have to dead-code-eliminate that array, and that array trick is just ugly.

Of these 14 is the most confusing. Here we create an anonymous array of all 0s, meanwhile as a side effect construct our result, then discard the array. The discarded array has a number of 0s in it equal to the size of our pack, plus 1 (which we add so we can handle empty packs).

A detailed explanation of what the c++14 version is doing. This is a trick/hack, and the fact you have to do this to expand parameters packs with efficiency in C++14 is one of the reasons why fold expressions were added in c++17.

It is best understood from the inside out:

    r |= (1ull << indexes) // side effect, used

this just updates r with 1<<indexes for a fixed index. indexes is a parameter pack, so we'll have to expand it.

The rest of the work is to provide a parameter pack to expand indexes inside of.

One step out:

(void(
    r |= (1ull << indexes) // side effect, used
  ),0)

here we cast our expression to void, indicating we don't care about its return value (we just want the side effect of setting r -- in C++, expressions like a |= b also return the value they set a to).

Then we use the comma operator , and 0 to discard the void "value", and return the value 0. So this is an expression whose value is 0 and as a side effect of calculating 0 it sets a bit in r.

  int discard[] = {0,(void(
    r |= (1ull << indexes) // side effect, used
  ),0)...};

At this point, we expand the parameter pack indexes. So we get:

 {
    0,
    (expression that sets a bit and returns 0),
    (expression that sets a bit and returns 0),
    [...]
    (expression that sets a bit and returns 0),
  }

in the {}. This use of , is not the comma operator, but rather the array element separator. This is sizeof...(indexes)+1 0s, which also set bits in r as a side effect. We then assign the {} array construction instructions to an array discard.

Next we cast discard to void -- most compilers will warn you if you create a variable and never read it. All compilers will not complain if you cast it to void, it is sort of a way to say "Yes, I know, I'm not using this", so it suppresses the warning.

118

answered Oct 12 '22 07:10

Yakk - Adam Nevraumont

The optimization you're looking for seems to be loop peeling, which is enabled at -O3, or manually with -fpeel-loops. I'm not sure why this falls under the purview of loop peeling rather than loop unrolling, but possibly it's unwilling to unroll a loop with nonlocal control flow inside it (as there is, potentially, from the range check).

By default, though, GCC stops short of being able to peel all the iterations, which apparently is necessary. Experimentally, passing -O2 -fpeel-loops --param max-peeled-insns=200 (the default value is 100) gets the job done with your original code: https://godbolt.org/z/NNWrga

answered Oct 12 '22 06:10

Sneftel

Related questions
                            
                                Weighted random numbers
                            
                                undefined reference to boost::system::system_category() when compiling
                            
                                Different floating point result with optimization enabled - compiler bug?
                            
                                What are transparent comparators?
                            
                                Downcasting shared_ptr<Base> to shared_ptr<Derived>?
                            
                                How can I solve the error LNK2019: unresolved external symbol - function?
                            
                                In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing)
                            
                                C++ Object Instantiation
                            
                                multiple definition of template specialization when using different objects
                            
                                When should I make explicit use of the `this` pointer?
                            
                                How to overload the operator++ in two different ways for postfix a++ and prefix ++a?
                            
                                C++ Best way to get integer division and remainder
                            
                                Is there an online name demangler for C++? [closed]
                            
                                Which one will execute faster, if (flag==0) or if (0==flag)?
                            
                                How do you disable the unused variable warnings coming out of gcc in 3rd party code I do not wish to edit?
                            
                                Hidden Features of C++? [closed]
                            
                                Is it intended by the C++ standards committee that in C++11 unordered_map destroys what it inserts?
                            
                                How are VST Plugins made?
                            
                                Can I call memcpy() and memmove() with "number of bytes" set to zero?
                            
                                Why are the fast integer types faster than the other integer types?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With