Say I'm working on a library called libModern. This library uses a legacy C library, called libLegacy, as an implementation strategy. libLegacy's interface looks like this:
typedef uint32_t LegacyFlags;
struct LegacyFoo {
uint32_t x;
uint32_t y;
LegacyFlags flags;
// more data
};
struct LegacyBar {
LegacyFoo foo;
float a;
// more data
};
void legacy_input(LegacyBar const* s); // Does something with s
void legacy_output(LegacyBar* s); // Stores data in s
libModern shouldn't expose libLegacy's types to its users for various reasons, among them:
The textbook way to deal with this situation is the pimpl idiom: libModern would provide a wrapper type that internally has a pointer to the legacy data. However, this is not possible here, since libModern cannot allocate dynamic memory. Generally, its goal is not to add a lot of overhead.
Therefore, libModern defines its own types that are layout-compatible with the legacy types, yet have a better interface. In this example it is using a strong enum instead of a plain uint32_t for flags:
enum class ModernFlags : std::uint32_t
{
first_flag = 0,
second_flag = 1,
};
struct ModernFoo {
std::uint32_t x;
std::uint32_t y;
ModernFlags flags;
// More data
};
struct ModernBar {
ModernFoo foo;
float a;
// more data
};
Now the question is: How can libModern convert between the legacy and the modern types without much overhead? I know of 3 options:
reinterpret_cast. This is undefined behavior, but in practice produces perfect assembly. I want to avoid this, since I cannot rely on this still working tomorrow or on another compiler.std::memcpy. In simple cases this generates the same optimal assembly, but in any non-trivial case this adds significant overhead.std::bit_cast. In my tests, at best it produces exactly the same code as memcpy. In some cases it's worse.This is a comparison of the 3 ways to interface with libLegacy:
legacy_input()
reinterpret_cast:
void input_ub(ModernBar const& s) noexcept {
legacy_input(reinterpret_cast<LegacyBar const*>(&s));
}
Assembly:
input_ub(ModernBar const&):
jmp legacy_input
This is perfect codegen, but it invokes UB.memcpy:
void input_memcpy(ModernBar const& s) noexcept {
LegacyBar ls;
std::memcpy(&ls, &s, sizeof(ls));
legacy_input(&ls);
}
Assembly:
input_memcpy(ModernBar const&):
sub rsp, 24
movdqu xmm0, XMMWORD PTR [rdi]
mov rdi, rsp
movaps XMMWORD PTR [rsp], xmm0
call legacy_input
add rsp, 24
ret
Significantly worse.bit_cast:
void input_bit_cast(ModernBar const& s) noexcept {
LegacyBar ls = std::bit_cast<LegacyBar>(s);
legacy_input(&ls);
}
Assembly:
input_bit_cast(ModernBar const&):
sub rsp, 40
movdqu xmm0, XMMWORD PTR [rdi]
mov rdi, rsp
movaps XMMWORD PTR [rsp+16], xmm0
mov rax, QWORD PTR [rsp+16]
mov QWORD PTR [rsp], rax
mov rax, QWORD PTR [rsp+24]
mov QWORD PTR [rsp+8], rax
call legacy_input
add rsp, 40
ret
And I have no idea what's going on here.reinterpret_cast:
auto output_ub() noexcept -> ModernBar {
ModernBar s;
legacy_output(reinterpret_cast<LegacyBar*>(&s));
return s;
}
Assembly:
output_ub():
sub rsp, 56
lea rdi, [rsp+16]
call legacy_output
mov rax, QWORD PTR [rsp+16]
mov rdx, QWORD PTR [rsp+24]
add rsp, 56
ret
memcpy:
auto output_memcpy() noexcept -> ModernBar {
LegacyBar ls;
legacy_output(&ls);
ModernBar s;
std::memcpy(&s, &ls, sizeof(ls));
return s;
}
Assembly:
output_memcpy():
sub rsp, 56
lea rdi, [rsp+16]
call legacy_output
mov rax, QWORD PTR [rsp+16]
mov rdx, QWORD PTR [rsp+24]
add rsp, 56
ret
bit_cast:
auto output_bit_cast() noexcept -> ModernBar {
LegacyBar ls;
legacy_output(&ls);
return std::bit_cast<ModernBar>(ls);
}
Assembly:
output_bit_cast():
sub rsp, 72
lea rdi, [rsp+16]
call legacy_output
movdqa xmm0, XMMWORD PTR [rsp+16]
movaps XMMWORD PTR [rsp+48], xmm0
mov rax, QWORD PTR [rsp+48]
mov QWORD PTR [rsp+32], rax
mov rax, QWORD PTR [rsp+56]
mov QWORD PTR [rsp+40], rax
mov rax, QWORD PTR [rsp+32]
mov rdx, QWORD PTR [rsp+40]
add rsp, 72
ret
Here you can find the entire example on Compiler Explorer.
I also noted that the codegen varies significantly depending on the exact definition of the structs (i.e. order, amount & type of members). But the UB version of the code is consistently better or at least as good as the other two versions.
Now my questions are:
In your compiler explorer link, Clang produces the same code for all output cases. I don't know what problem GCC has with std::bit_cast in that situation.
For the input case, the three functions cannot produce the same code, because they have different semantics.
With input_ub, the call to legacy_input may be modifying the caller's object. This cannot be the case in the other two versions. Therefore the compiler cannot optimize away the copies, not knowing how legacy_input behaves.
If you pass by-value to the input functions, then all three versions produce the same code at least with Clang in your compiler explorer link.
To reproduce the code generated by the original input_ub you need to keep passing the address of the caller's object to legacy_input.
If legacy_input is an extern C function, then I don't think the standards specify how the object models of the two languages are supposed to interact in this call. So, for the purpose of the language-lawyer tag, I will assume that legacy_input is an ordinary C++ function.
The problem in passing the address of &s directly is that there is generally no LegacyBar object at the same address that is pointer-interconvertible with the ModernBar object. So if legacy_input tries to access LegacyBar members through the pointer, that would be UB.
Theoretically you could create a LegacyBar object at the required address, reusing the object representation of the ModernBar object. However, since the caller presumably will expect there to still be a ModernBar object after the call, you then need to recreate a ModernBar object in the storage by the same procedure.
Unfortunately though, you are not always allowed to reuse storage in this way. For example if the passed reference refers to a const complete object, that would be UB, and there are other requirements. The problem is also whether the caller's references to the old object will refer to the new object, meaning whether the two ModernBar objects are transparently replaceable. This would also not always be the case.
So in general I don't think you can achieve the same code generation without undefined behavior if you don't put additional constraints on the references passed to the function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With