Is it possible to use SSE instructions on the underlying data of a std::bitset? The bitsets I am working with are larger than unsigned long, so the to_ulong() method will not suffice. For instance, can I use an instruction like this: <pre class="prettyprint"><code>__m128i* ptr= (__m128i*)(&my_bitset[0]); </code></pre> Then perform SSE operations per normal? I have tried to search the internet quite a bit for people using std::bitset with SSE and it doesn't appear to be a common use case.

<blockquote> Is it possible to use SSE instructions on the underlying data of a std::bitset? </blockquote> In <pre class="prettyprint"><code>__m128i* ptr= (__m128i*)(&my_bitset[0]); </code></pre> <code>my_bitset[0]</code> returns a temporary proxy object of unspecified layout, which contains a pointer to the container/storage and the bit index (e.g. GNU C++ <code>std::bitset::reference</code> implementation) . Casting a pointer to this temporary proxy object to <code>__m128i*</code> would be meaningless. But C++ doesn't allow taking addresses of temporary objects at all, so that <code>&my_bitset[0]</code> results in a compiler error. <hr> <code>std::bitset</code> may use SIMD instructions for its member functions automatically if/when the compiler chooses to vectorize it. In this example, gcc decided to use AVX-256 instructions, whereas clang decided not to. Both choices aren't ideal: <ul> <li> gcc generated AVX instructions with 256-bit <code>ymm</code> registers, which reduce CPU frequency on older Intel CPUs (or crash overclocked ones with forced AVX offset of 0). But the vector size is too small to justify paying the price of increased CPU power consumption and possibly lower frequency when using sporadic <code>ymm</code> register instructions here and there. </li> <li> clang generated 64-bit general purpose register instructions, which take more instruction bytes and more loads/stores, than SSE with 128-bit <code>xmm</code> registers would. CPUs can only perform a fixed number of load/store instructions (not bytes) per cycle, so it makes sense to maximize the amount of data loaded and stored per one instruction. </li> </ul> The ideal choice in this example may be to use SSE instructions with 128-bit <code>xmm</code> registers - minimize the number of load/store instructions without downclocking the CPU. Which goes to show that compiler vectorization is often not ideal. <hr> <code>std::bitset</code>, unfortunately, doesn't provide direct access to its storage, and any access to it by a C-style cast may result in undefined behavior without a warning or error due to layout, alignment or strict aliasing violation. <code>std::bitset</code> is unlikely to use any non-standard/SIMD type for its storage because of portability constraint, so that casting its storage to a wider SIMD type pretty much guarantees alignment and strict aliasing violation. There are non-portable ways to work-around that, but that is brittle against future changes and that's why I cannot recommend going this way. <hr> You may like to look for other containers designed with SIMD in mind, such as Vc: portable, zero-overhead C++ types for explicitly data-parallel programming. It allows to choose the SIMD instruction type to use on per-container-class basis, e.g. you may only like to use 128-bit <code>xmm</code> registers instructions for this particular container type, even if 256-but <code>ymm</code> registers are available. <hr> gcc and clang both support Using Vector Instructions through Built-in Functions on types declared with <code>__attribute__((vector_size (N)))</code>, which is another way: <blockquote> Currently, GCC allows using the following operators on these types: +, -, *, /, unary minus, ^, |, &, ~, %. </blockquote> But these don't allow choosing the underlying SIMD type/instructions on per-container-class basis, only per object file with compiler options like <code>-mno-avx</code>.

<code>bitset</code> does not have a standard way to access its internal data. There's itsy_bitsy library that provides an interface similar to <code>bitset</code> to other data. <code>bit_view</code> is what you need, it wraps data with ability to manipulate bits, but without <code>insert</code>/<code>erase</code> operations. Not sure if you can have <code>bitsy::bit_view</code> directly on <code>__m128i</code> type, but it supports like <code>bitsy::bit_view<std::span<char>></code>, so you can have <code>__m128i</code> variable(s) and reinterpret it as a span of <code>char</code>s,

std::bitset and SSE instructions

Tags:

c++

sse

bitset

Is it possible to use SSE instructions on the underlying data of a std::bitset? The bitsets I am working with are larger than unsigned long, so the to_ulong() method will not suffice. For instance, can I use an instruction like this:

__m128i* ptr= (__m128i*)(&my_bitset[0]);

Then perform SSE operations per normal?

I have tried to search the internet quite a bit for people using std::bitset with SSE and it doesn't appear to be a common use case.

870

asked Jun 07 '13 18:06

kip622

3 Answers

Is it possible to use SSE instructions on the underlying data of a std::bitset?

__m128i* ptr= (__m128i*)(&my_bitset[0]);

my_bitset[0] returns a temporary proxy object of unspecified layout, which contains a pointer to the container/storage and the bit index (e.g. GNU C++ std::bitset::reference implementation) . Casting a pointer to this temporary proxy object to __m128i* would be meaningless. But C++ doesn't allow taking addresses of temporary objects at all, so that &my_bitset[0] results in a compiler error.

std::bitset may use SIMD instructions for its member functions automatically if/when the compiler chooses to vectorize it.

In this example, gcc decided to use AVX-256 instructions, whereas clang decided not to. Both choices aren't ideal:

gcc generated AVX instructions with 256-bit ymm registers, which reduce CPU frequency on older Intel CPUs (or crash overclocked ones with forced AVX offset of 0). But the vector size is too small to justify paying the price of increased CPU power consumption and possibly lower frequency when using sporadic ymm register instructions here and there.
clang generated 64-bit general purpose register instructions, which take more instruction bytes and more loads/stores, than SSE with 128-bit xmm registers would. CPUs can only perform a fixed number of load/store instructions (not bytes) per cycle, so it makes sense to maximize the amount of data loaded and stored per one instruction.

The ideal choice in this example may be to use SSE instructions with 128-bit xmm registers - minimize the number of load/store instructions without downclocking the CPU. Which goes to show that compiler vectorization is often not ideal.

std::bitset, unfortunately, doesn't provide direct access to its storage, and any access to it by a C-style cast may result in undefined behavior without a warning or error due to layout, alignment or strict aliasing violation.

std::bitset is unlikely to use any non-standard/SIMD type for its storage because of portability constraint, so that casting its storage to a wider SIMD type pretty much guarantees alignment and strict aliasing violation. There are non-portable ways to work-around that, but that is brittle against future changes and that's why I cannot recommend going this way.

You may like to look for other containers designed with SIMD in mind, such as Vc: portable, zero-overhead C++ types for explicitly data-parallel programming. It allows to choose the SIMD instruction type to use on per-container-class basis, e.g. you may only like to use 128-bit xmm registers instructions for this particular container type, even if 256-but ymm registers are available.

gcc and clang both support Using Vector Instructions through Built-in Functions on types declared with __attribute__((vector_size (N))), which is another way:

Currently, GCC allows using the following operators on these types: +, -, *, /, unary minus, ^, |, &, ~, %.

But these don't allow choosing the underlying SIMD type/instructions on per-container-class basis, only per object file with compiler options like -mno-avx.

152

answered Oct 05 '22 12:10

Maxim Egorushkin

bitset does not have a standard way to access its internal data.

There's itsy_bitsy library that provides an interface similar to bitset to other data. bit_view is what you need, it wraps data with ability to manipulate bits, but without insert/erase operations.

Not sure if you can have bitsy::bit_view directly on __m128i type, but it supports like bitsy::bit_view<std::span<char>>, so you can have __m128i variable(s) and reinterpret it as a span of chars,

answered Oct 05 '22 13:10

Alex Guteniev

You can just use SIMD on the whole `bitset` object, if you know the object layout of your standard library.

Most implementations of std::bitset<> make the obvious implementation choice that the object-representation of the whole bitset object is just the bits, packed into contiguous bytes. (I'd be surprised if any mainstream real-world implementation wasn't like that, but there's no guarantee you can safely assume that.) Most of those choose to use an array of some integer type wider than a byte.

If we're talking about just the x86 compilers that implement Intel's intrinsics API, that's an smaller set of implementations.

In GNU libstdc++ at least, the lowest-address chunk hold bits 0..63, and so on. (So it's little-endian across chunks, and x86 is little-endian for the bytes within chunks.) And bitset[0] is the low byte of the low word, i.e. load and and eax, 1. It's possible that implementations might make different choices, like storing the bitset[0] at the bottom of the highest-address chunk, big-endian style. That wouldn't line up with how x86 bt / bts bitstring instructions index memory, but they're slow anyway so the main reason for not doing so is that it would be more work to turn a runtime-variable index into an address and bitmask or shift count.

If you want to try to non-portably take advantage of this, use _mm_loadu_si128 on the std::bitset object itself, not on a bit-iterator that &bitset[0] returns.

#include <bitset>
#include <immintrin.h>

// being a struct or class member isn't necessary, just a handy place to put an alignas()
// for example purposes.
struct foo {
 alignas(32) std::bitset<384> bs;  // 32-byte aligned, 48 bytes total.
           // alignas(16) would be sufficient for what I'm doing with SSE2
 int x, y;                 // with or without these, the struct size is a multiple of the alignment, thus 64B.
};
  // beware that allocating this with  new  might not respect alignment before C++17


void bar(foo *pfoo)
{
    char *bsp = (char*) &(pfoo->bs);   // pointer to (the first byte of) the bitset
      // as a char* so pointer math works in bytes.
      // unfortunately load/store intrinsics require casting back to __m128i*
      // until AVX-512 when Intel realized void* would be better.

    __m128i v0 = _mm_load_si128( (__m128i*)bsp );   // aligned load of bits 0..127
    __m128i v1 = _mm_loadu_si128( vb+3 );   // unaligned load of bits 24..152 
    v0 = _mm_and_si128(v0, v1);
    _mm_store_si128(vb+16, v0);            // aligned store at another alignment boundary
}

This compiles (with GCC11.2 on Godbolt) to the following asm:

bar(foo*):
        movdqu  xmm0, XMMWORD PTR [rdi+3]    # unaligned load has to use movdqu
        pand    xmm0, XMMWORD PTR [rdi]      # aligned load can fold into a memory operand even without AVX
        movaps  XMMWORD PTR [rdi+16], xmm0   # aligned store.  (movaps is shorter than movdqa)
        ret

With AVX, the compiler could have chosen to do a vmovdqa load for v0 and use an unaligned memory source operand for vpand xmm0, xmm0, [rdi+3], but I compiled without -march=haswell to demo the SSE advantage of being able to use aligned load intrinsics. (See also Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? re: tuning options in older GCC.)

You can even alignas(32) std::bitset<256> bs to align that instance of the bitset by 32 bytes, allowing use of aligned load/store like _mm256_load_si256 instead of loadu. There could still be other object in part of the last 32 bytes, if your bitset isn't a multiple of 256 bits, so don't assume it's just alignment padding you can step on. It wouldn't be thread-safe to do a non-atomic load/store of those bytes (e.g. if you're modifying the bits that are part of the bitset, and storing back the later bytes unchanged.)

Beware that allocating objects with more alignment than alignof(max_align_t) (typically 16 in x86-64 implementations) is only well-supported with new in C++17. Before that, alignas() only Just Worked for static and automatic storage.

Reminder: nothing guarantees this is portable

But it will probably work, on a C++ implementation that isn't a DeathStation 9000.

If you can't / don't want to hand-roll your own bitmap, or don't want to use Alex's suggestion of itsy_bitsy which has a documented way to get at the data, then this hack might be worth it if you can't get your compiler to make efficient asm in a more portable way.

As long as your C++ library implements bitset with something like class bitset { private: _chunk_t _data[size]; } or something like that, there's nothing undefined about messing with the object-representation via intrinsics. (GNU libstdc++ uses _WordT _M_w[_Nw];)

Intrinsics are defined to safely alias any other data, just like char*. GCC/clang implement this by defining them as may_alias types. See Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
(This does bypass the normal public / private restrictions, though.)

If this somehow breaks with some future compiler version, that's your problem. I think it's unlikely that something would change normal std::bitset implementations to not have their object representation just be the array of bits, though.

You can look at the asm for something like return pfoo->bs.to_ulong() and see what it loads to check for set high bits (unfortunately not vectorizing the test), before loading the low chunk. That confirms the bits are where we expected. (See the Godbolt link).

If you do this, write a unit test that uses _mm_set_epi32(1,0,0,0) or something and store that to the bitset, then make sure the one set bit is where you expect it to be, at bs[96]. That way you'll detect if the implementation changes the layout of std::bitset<>.

You could also use a static_assert on the size. For a size like 256 bits, sizeof() will be a constant 32 even across implementations that use char bits[32] or uint64_t bigchunks[4]. sizeof(std::bitset<129>) could vary, though. But static_assert won't catch differences in the order of the words or bits within a word.

If you can use C++20, then the unit test for bit order can also be put in static_assert, as bitset methods are constexpr, and there's std::bit_cast that can be used in compile time. Though in this case the unit test wouldn't be able to use SSE intrinsics, and will have to use plain C++ operations. You could use char* operations to manipulate the object-representation of a std::bitset the same way you would with intrinsics, though. Or even better, use std::bit_cast<> which, shouldn't compile for types with a vtable or something, at least in a constexpr context. For example, Alex suggested https://godbolt.org/z/1advToGf5 in comments.

The very fact that std::bitset operations will be constexpr in C++20 probably rules out some insane implementation choices entirely.

answered Oct 05 '22 12:10

Peter Cordes

Related questions
                            
                                What is the maximum number of members a class can have in c++
                            
                                Visual C++ 2010 accepts string for bool in overloaded function
                            
                                Are there tools to transform source code in C++ to the source code in C/C++, but with instantiated (unrolled) templates? [closed]
                            
                                static libzip with Visual Studio 2012
                            
                                Dynamically creating a C++ function argument list at runtime
                            
                                error MSB6006: "cmd.exe" exited with code 1 running QT application
                            
                                C++ getline() doesn't require namespace declaration
                            
                                Why does SetupDiEnumDriverInfo give two version numbers for my driver
                            
                                operator<<(ostream& os, ...) for template class
                            
                                Hotspot in a for loop
                            
                                OpenCV: how to set alpha transparency of a pixel
                            
                                Is it ok to do printing/logging in global object constructor or it's an undefined behavior?
                            
                                Qt name collision, but no_keywords not an option
                            
                                Detection of real object using camera. Possibilities of finding real objects using camera [closed]
                            
                                Exception safety of std::function
                            
                                Are std::functions inlined by the C++11 compiler?
                            
                                Whis is faster for getting a part of the string, std::string::erase or std::string::substr
                            
                                Does inserting/erasing an element from a std::map modify the iteration sequence?
                            
                                Qt - Handle QTcpSocket in a new thread
                            
                                What's the effect of __attribute__ ((__packed__)) on nested structs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

std::bitset and SSE instructions

Tags:

c++

sse

bitset

kip622

People also ask

3 Answers

Maxim Egorushkin

Alex Guteniev

You can just use SIMD on the whole `bitset` object, if you know the object layout of your standard library.

Reminder: nothing guarantees this is portable

Peter Cordes

Recent Activity

Donate For Us

std::bitset and SSE instructions

Tags:

c++

sse

bitset

kip622

People also ask

3 Answers

Maxim Egorushkin

Alex Guteniev

You can just use SIMD on the whole bitset object, if you know the object layout of your standard library.

Reminder: nothing guarantees this is portable

Peter Cordes

Related questions

Recent Activity

Donate For Us

You can just use SIMD on the whole `bitset` object, if you know the object layout of your standard library.