Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert uint64_t to byte array portably and optimally in Clang

If you want to convert uint64_t to a uint8_t[8] (little endian). On a little endian architecture you can just do an ugly reinterpret_cast<> or memcpy(), e.g:

void from_memcpy(const std::uint64_t &x, uint8_t* bytes) {
    std::memcpy(bytes, &x, sizeof(x));
}

This generates efficient assembly:

mov     rax, qword ptr [rdi]
mov     qword ptr [rsi], rax
ret

However it is not portable. It will have different behaviour on a little endian machine.

For converting uint8_t[8] to uint64_t there is a great solution - just do this:

void to(const std::uint8_t* bytes, std::uint64_t &x) {
    x = (std::uint64_t(bytes[0]) << 8*0) |
        (std::uint64_t(bytes[1]) << 8*1) |
        (std::uint64_t(bytes[2]) << 8*2) |
        (std::uint64_t(bytes[3]) << 8*3) |
        (std::uint64_t(bytes[4]) << 8*4) |
        (std::uint64_t(bytes[5]) << 8*5) |
        (std::uint64_t(bytes[6]) << 8*6) |
        (std::uint64_t(bytes[7]) << 8*7);
}

This looks inefficient but actually with Clang -O2 it generates exactly the same assembly as before, and if you compile on a big endian machine it will be smart enough to use a native byte swap instruction. E.g. this code:

void to(const std::uint8_t* bytes, std::uint64_t &x) {
    x = (std::uint64_t(bytes[7]) << 8*0) |
        (std::uint64_t(bytes[6]) << 8*1) |
        (std::uint64_t(bytes[5]) << 8*2) |
        (std::uint64_t(bytes[4]) << 8*3) |
        (std::uint64_t(bytes[3]) << 8*4) |
        (std::uint64_t(bytes[2]) << 8*5) |
        (std::uint64_t(bytes[1]) << 8*6) |
        (std::uint64_t(bytes[0]) << 8*7);
}

Compiles to:

mov     rax, qword ptr [rdi]
bswap   rax
mov     qword ptr [rsi], rax
ret

My question is: is there an equivalent reliably-optimised construct for converting in the opposite direction? I've tried this, but it gets compiled naively:

void from(const std::uint64_t &x, uint8_t* bytes) {
    bytes[0] = x >> 8*0;
    bytes[1] = x >> 8*1;
    bytes[2] = x >> 8*2;
    bytes[3] = x >> 8*3;
    bytes[4] = x >> 8*4;
    bytes[5] = x >> 8*5;
    bytes[6] = x >> 8*6;
    bytes[7] = x >> 8*7;
}

Edit: After some experimentation, this code does get compiled optimally with GCC 8.1 and later as long as you use uint8_t* __restrict__ bytes. However I still haven't managed to find a form that Clang will optimise.

like image 785
Timmmm Avatar asked May 07 '19 12:05

Timmmm


1 Answers

Here's what I could test based on the discussion in OP's comments:

void from_optimized(const std::uint64_t &x, std::uint8_t* bytes) {
    std::uint64_t big;
    std::uint8_t* temp = (std::uint8_t*)&big;
    temp[0] = x >> 8*0;
    temp[1] = x >> 8*1;
    temp[2] = x >> 8*2;
    temp[3] = x >> 8*3;
    temp[4] = x >> 8*4;
    temp[5] = x >> 8*5;
    temp[6] = x >> 8*6;
    temp[7] = x >> 8*7;
    std::uint64_t* dest = (std::uint64_t*)bytes;
    *dest = big;
}

Looks like this will make things clearer for the compiler and let it assume the necessary parameters to optimize it (both on GCC and Clang with -O2).

Compiling to x86-64 (little endian) on Clang 8.0.0 (test on Godbolt):

mov     rax, qword ptr [rdi]
mov     qword ptr [rsi], rax
ret

Compiling to aarch64_be (big endian) on Clang 8.0.0 (test on Godbolt):

ldr     x8, [x0]
rev     x8, x8
str     x8, [x1]
ret
like image 106
Toribio Avatar answered Sep 17 '22 19:09

Toribio