Detecting matching bits in C++

Tags:

c++

I'm trying to take two bitset objects, for example

a = 10010111
b = 01110010

and remove bits from both variables if they match in the same position/index. So we'd be left with

a = 100xx1x1 = 10011
b = 011xx0x0 = 01100

Is there any way to achieve this?

526

asked Jan 18 '17 13:01

2 Answers

Other answers have shown nice, idiomatic C++ ways of doing this. Unfortunately, they are going to be rather slow. Even AndyG's clever template-based solution, although it does do as much of the work as possible at compile time, still causes the compiler to generate a lot of code that must be executed at runtime.

If you care about speed and are targeting a processor that supports the BMI2 instruction set (which would be Intel Haswell and later, or AMD Excavator and later), then you can use the PEXT instruction, which performs a parallel bit extraction. This allows you to literally solve the entire problem in about two machine instructions.

Since you're not writing in assembly, you would use the corresponding intrinsic for the PEXT instruction, which is _pext_u32. In its basic form, the code is simple, readable, and extremely efficient:

#include <stdint.h>      // for uint32_t
#include <x86intrin.h>   // for _pext_u32()  [on MSVC, drop the 'x86']

void RemoveMatchingBits(uint32_t& a, uint32_t& b)
{
   const uint32_t mask = (a ^ b);
   a = _pext_u32(a, mask);
   b = _pext_u32(b, mask);
}

First, you bitwise-XOR the two values (a and b together). This will generate a mask, where each bit in the mask is set if the corresponding bit is set in either a or b, otherwise that bit is not set. This mask is then used as the basis for the bit extraction performed by _pext_u32. The same mask is used for both bit-extraction operations, so only a single XOR instruction is required. Each _pext_u32 intrinsic will compile to a PEXT instruction. So, aside from some MOV instructions to shuffle around values (which will depend on the compiler used to generate the code and whether this code is inlined), there are only three machine-code instructions required. Here's how contemporary versions of GCC and Clang compile the above function (MSVC and ICC emit code that is extremely similar):

RemoveMatchingBits(unsigned int&, unsigned int&):
    mov     eax, DWORD PTR [rdi]    // rdi contains a pointer to 'a'
    mov     edx, DWORD PTR [rsi]    // rsi contains a pointer to 'b'
    xor     edx, eax
    pext    eax, eax, edx
    mov     DWORD PTR [rdi], eax
    mov     eax, DWORD PTR [rsi]
    pext    eax, eax, edx
    mov     DWORD PTR [rsi], eax
    ret

As you can see, most of the extra instructions here are MOVs, mandated by the way that we've written the function to accept its arguments by-reference and modify those values in place. Tweaking how the function is written, and/or by getting the optimizer to inline it at the call site, will yield an even more efficient implementation.

If you want to use a std::bitset, just modify the code slightly. The to_ulong() member function allows you to access the raw bits for manipulation. Something like:

void RemoveMatchingBits(std::bitset<8>& a, std::bitset<8>& b)
{
   const std::bitset<8> mask = (a ^ b);
   a = _pext_u32(static_cast<uint32_t>(a.to_ulong()), static_cast<uint32_t>(mask.to_ulong()));
   b = _pext_u32(static_cast<uint32_t>(b.to_ulong()), static_cast<uint32_t>(mask.to_ulong()));
}

Note that this further decreases the efficiency of the generated code, given the need to deal with the std::bitset object. In particular, the to_ulong() member function has to detect and throw an exception in the case of overflow, and MSVC seems incapable of optimizing that check out, even though a std::bitset<8> cannot possibly overflow a 32-bit integer type. Oh well—the code will be fast enough, and no one said abstractions were completely free.

If you cannot compile assuming BMI2 support, you can check at runtime using the CPUID instruction (virtually all x86 compilers provide an intrinsic for this).

If it is not available, you are not targeting x86, or if you just don't want to worry about the complexity of run-time delegation, then you can fall back to an alternative bit-twiddling implementation. Specifically, what you want is a "compress" operation. Discussion and code for this is given in section 7–4 of Henry S. Warren, Jr.'s classic book, Hacker's Delight.

Here is a straightforward, loop-based implementation of "compress", adapted from Figure 7–9 in Hacker's Delight:

uint32_t compress(uint32_t value, uint32_t mask)
{
   uint32_t result = 0;
   uint32_t shift  = 0;
   uint32_t maskBit;
   do
   {
        maskBit = (mask & 1);
        result |= ((value & maskBit) << shift);
        shift  += maskBit;
        value >>= 1;
        mask  >>= 1;
    } while (mask != 0);
    return result;
}

This adequately simulates the PEXT instruction, but it isn't fast. The following code implements the same algorithm, but uses a faster "parallel suffix" method based on Figure 7–10 in Hacker's Delight:

uint32_t fallback_pext_u32(uint32_t value, uint32_t mask)
{
   const int log2BitSize = 5;                     // log_2 of the bit size (here, 32 bits)

   value &= mask;                                 // clear irrelevant bits    
   uint32_t mk = (~mask << 1);                    // we will count 0's to the right
   uint32_t mp;
   uint32_t mv;
   uint32_t t;
   for (int i = 0; i < log2BitSize; ++i)
   {
      mp     = mk ^ (mk <<  1);                   // parallel suffix
      mp     = mp ^ (mp <<  2);
      mp     = mp ^ (mp <<  4);
      mp     = mp ^ (mp <<  8);
      mp     = mp ^ (mp << 16);
      mv     = (mp & mask);                       // bits to move
      mask   = ((mask ^ mv) | (mv >> (1 << i)));  // compress mask
      t      = (value & mv);
      value  = ((value ^ t) | (t >> (1 << i)));   // compress value
      mk    &= ~mp;
   }
   return value;
}

This fallback implementation be slower than a single PEXT instruction, but it is completely branchless, so there won't be any hidden penalties for mispredicted branches when dealing with random input. You should get maximum possible throughput from your CPU here, but either way, it will certainly be much faster than a for loop with a series of conditional branches, as proposed by the other answers.

answered Oct 04 '22 14:10

Cody Gray

You could use boost::dynamic_bitset<> for the result, then using push_back you can create the bitset dynamically.

#include <iostream>
#include <boost/dynamic_bitset.hpp>
#include <bitset>

int main()
{
    const int N = 8;
    boost::dynamic_bitset<> a_out(0);
    boost::dynamic_bitset<> b_out(0); 
    std::bitset<N>a(0x97); //10010111
    std::bitset<N>b(0x72); //01110010

    for (int i = 0; i < N; i++)
    {
        if (a[i] != b[i])
        {
            a_out.push_back(bool(a[i]));
            b_out.push_back(bool(b[i]));
        }
    }


    std::cout << a_out << "\n";
    std::cout << b_out << "\n";

    return 0;
}

Try here!

Output:
10011
01100

[EDITED] And if you want to optimize you can add this before the for loop(But you must to have boost 1.62 or newer to use reserve())

//@5gon12eder Optimization
const auto xorified = a ^ b;
const auto n = xorified.count();
a_out.reserve(n); 
b_out.reserve(n);

And inside the for loop compare bits as:

if (xorified[i]) { ... }

answered Oct 04 '22 14:10

Rama

Related questions
                            
                                Dynamically combine Boost.Spirit.Qi rules at runtime (arbitrary number of alternatives)
                            
                                How to determine Signalled state of Win32 Event?
                            
                                What's the equivalent 'nth_element' function in Java?
                            
                                What does && mean with a parameter type in C++? [duplicate]
                            
                                C++ Reinforcement Learning Library [closed]
                            
                                Can a standard-compliant compiler reject code containing dynamic_cast downcast from non-polymorphic type?
                            
                                error: cannot bind ‘std::basic_ostream<char>’ lvalue to ‘std::basic_ostream<char>&&’
                            
                                cholesky decomposition ScaLapack error
                            
                                When the data structure is a template parameter, how can I tell if an operation will invalidate an iterator?
                            
                                Can't add perf probe for C++ methods
                            
                                Passing a C++ function to a javascript function in emscripten
                            
                                Can I build a bidirectional coroutine with Boost 1.55?
                            
                                WaitForMultipleObjects alternative with std::thread?
                            
                                Program runs slower when launched outside of Visual Studio
                            
                                Moving objects from one unordered_map to another container
                            
                                GCC pure/const functions that accept a pointer argument
                            
                                How to save an image in Intel RealSense(Visual C++)
                            
                                How do you load a scene while animating a sprite in cocos2d-x?
                            
                                What is the rule that allows `this->` to access members of dependent base classes?
                            
                                Why does taking a member function pointer value requires class name qualification even from inside of the class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Detecting matching bits in C++

Tags:

c++

eddiewastaken

People also ask

2 Answers

Cody Gray

Rama

Recent Activity

Donate For Us