Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient type punning without undefined behavior [duplicate]

Say I'm working on a library called libModern. This library uses a legacy C library, called libLegacy, as an implementation strategy. libLegacy's interface looks like this:

typedef uint32_t LegacyFlags;

struct LegacyFoo {
    uint32_t x;
    uint32_t y;
    LegacyFlags flags;
    // more data
};

struct LegacyBar {
    LegacyFoo foo;
    float a;
    // more data
};

void legacy_input(LegacyBar const* s); // Does something with s
void legacy_output(LegacyBar* s); // Stores data in s

libModern shouldn't expose libLegacy's types to its users for various reasons, among them:

  • libLegacy is an implementation detail that shouldn't be leaked. Future versions of libModern might chose to use another library instead of libLegacy.
  • libLegacy uses hard-to-use, easy-to-misuse types that shouldn't be part of any user-facing API.

The textbook way to deal with this situation is the pimpl idiom: libModern would provide a wrapper type that internally has a pointer to the legacy data. However, this is not possible here, since libModern cannot allocate dynamic memory. Generally, its goal is not to add a lot of overhead.

Therefore, libModern defines its own types that are layout-compatible with the legacy types, yet have a better interface. In this example it is using a strong enum instead of a plain uint32_t for flags:

enum class ModernFlags : std::uint32_t
{
    first_flag = 0,
    second_flag = 1,
};

struct ModernFoo {
    std::uint32_t x;
    std::uint32_t y;
    ModernFlags flags;
    // More data
};

struct ModernBar {
    ModernFoo foo;
    float a;
    // more data
};

Now the question is: How can libModern convert between the legacy and the modern types without much overhead? I know of 3 options:

  1. reinterpret_cast. This is undefined behavior, but in practice produces perfect assembly. I want to avoid this, since I cannot rely on this still working tomorrow or on another compiler.
  2. std::memcpy. In simple cases this generates the same optimal assembly, but in any non-trivial case this adds significant overhead.
  3. C++20's std::bit_cast. In my tests, at best it produces exactly the same code as memcpy. In some cases it's worse.

This is a comparison of the 3 ways to interface with libLegacy:

  1. Interfacing with legacy_input()
    1. Using reinterpret_cast:
      void input_ub(ModernBar const& s) noexcept {
          legacy_input(reinterpret_cast<LegacyBar const*>(&s));
      }
      
      Assembly:
      input_ub(ModernBar const&):
              jmp     legacy_input
      
      This is perfect codegen, but it invokes UB.
    2. Using memcpy:
      void input_memcpy(ModernBar const& s) noexcept {
          LegacyBar ls;
          std::memcpy(&ls, &s, sizeof(ls));
          legacy_input(&ls);
      }
      
      Assembly:
      input_memcpy(ModernBar const&):
              sub     rsp, 24
              movdqu  xmm0, XMMWORD PTR [rdi]
              mov     rdi, rsp
              movaps  XMMWORD PTR [rsp], xmm0
              call    legacy_input
              add     rsp, 24
              ret
      
      Significantly worse.
    3. Using bit_cast:
      void input_bit_cast(ModernBar const& s) noexcept {
          LegacyBar ls = std::bit_cast<LegacyBar>(s);
          legacy_input(&ls);
      }
      
      Assembly:
      input_bit_cast(ModernBar const&):
              sub     rsp, 40
              movdqu  xmm0, XMMWORD PTR [rdi]
              mov     rdi, rsp
              movaps  XMMWORD PTR [rsp+16], xmm0
              mov     rax, QWORD PTR [rsp+16]
              mov     QWORD PTR [rsp], rax
              mov     rax, QWORD PTR [rsp+24]
              mov     QWORD PTR [rsp+8], rax
              call    legacy_input
              add     rsp, 40
              ret
      
      And I have no idea what's going on here.
  2. Interfacing with legacy_output()
    1. Using reinterpret_cast:
      auto output_ub() noexcept -> ModernBar {
          ModernBar s;
          legacy_output(reinterpret_cast<LegacyBar*>(&s));
          return s;
      }
      
      Assembly:
      output_ub():
              sub     rsp, 56
              lea     rdi, [rsp+16]
              call    legacy_output
              mov     rax, QWORD PTR [rsp+16]
              mov     rdx, QWORD PTR [rsp+24]
              add     rsp, 56
              ret
      
    2. Using memcpy:
      auto output_memcpy() noexcept -> ModernBar {
          LegacyBar ls;
          legacy_output(&ls);
          ModernBar s;
          std::memcpy(&s, &ls, sizeof(ls));
          return s;
      }
      
      Assembly:
      output_memcpy():
              sub     rsp, 56
              lea     rdi, [rsp+16]
              call    legacy_output
              mov     rax, QWORD PTR [rsp+16]
              mov     rdx, QWORD PTR [rsp+24]
              add     rsp, 56
              ret
      
    3. Using bit_cast:
      auto output_bit_cast() noexcept -> ModernBar {
          LegacyBar ls;
          legacy_output(&ls);
          return std::bit_cast<ModernBar>(ls);
      }
      
      Assembly:
      output_bit_cast():
              sub     rsp, 72
              lea     rdi, [rsp+16]
              call    legacy_output
              movdqa  xmm0, XMMWORD PTR [rsp+16]
              movaps  XMMWORD PTR [rsp+48], xmm0
              mov     rax, QWORD PTR [rsp+48]
              mov     QWORD PTR [rsp+32], rax
              mov     rax, QWORD PTR [rsp+56]
              mov     QWORD PTR [rsp+40], rax
              mov     rax, QWORD PTR [rsp+32]
              mov     rdx, QWORD PTR [rsp+40]
              add     rsp, 72
              ret
      

Here you can find the entire example on Compiler Explorer.

I also noted that the codegen varies significantly depending on the exact definition of the structs (i.e. order, amount & type of members). But the UB version of the code is consistently better or at least as good as the other two versions.

Now my questions are:

  1. How come the codegen varies so dramatically? It makes me wonder if I'm missing something important.
  2. Is there something I can do to guide the compiler to generate better code without invoking UB?
  3. Are there other standard-conformant ways that generate better code?
like image 270
Drag-On Avatar asked Jun 18 '26 10:06

Drag-On


1 Answers

In your compiler explorer link, Clang produces the same code for all output cases. I don't know what problem GCC has with std::bit_cast in that situation.

For the input case, the three functions cannot produce the same code, because they have different semantics.

With input_ub, the call to legacy_input may be modifying the caller's object. This cannot be the case in the other two versions. Therefore the compiler cannot optimize away the copies, not knowing how legacy_input behaves.

If you pass by-value to the input functions, then all three versions produce the same code at least with Clang in your compiler explorer link.

To reproduce the code generated by the original input_ub you need to keep passing the address of the caller's object to legacy_input.

If legacy_input is an extern C function, then I don't think the standards specify how the object models of the two languages are supposed to interact in this call. So, for the purpose of the language-lawyer tag, I will assume that legacy_input is an ordinary C++ function.

The problem in passing the address of &s directly is that there is generally no LegacyBar object at the same address that is pointer-interconvertible with the ModernBar object. So if legacy_input tries to access LegacyBar members through the pointer, that would be UB.

Theoretically you could create a LegacyBar object at the required address, reusing the object representation of the ModernBar object. However, since the caller presumably will expect there to still be a ModernBar object after the call, you then need to recreate a ModernBar object in the storage by the same procedure.

Unfortunately though, you are not always allowed to reuse storage in this way. For example if the passed reference refers to a const complete object, that would be UB, and there are other requirements. The problem is also whether the caller's references to the old object will refer to the new object, meaning whether the two ModernBar objects are transparently replaceable. This would also not always be the case.

So in general I don't think you can achieve the same code generation without undefined behavior if you don't put additional constraints on the references passed to the function.

like image 154
user17732522 Avatar answered Jun 20 '26 00:06

user17732522



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!