Update: This was fixed in GCC 8.1. I'm working on a function that stores a 64-bit value into memory in big endian format. I was hoping that I could write portable C99 code that works on both little and big endian platforms and have modern x86 compilers generate a <code>bswap</code> instruction automatically without any builtins or intrinsics. So I started with the following function: <pre class="prettyprint"><code>#include <stdint.h> void encode_bigend_u64(uint64_t value, void *vdest) { uint8_t *bytes = (uint8_t *)vdest; bytes[0] = value >> 56; bytes[1] = value >> 48; bytes[2] = value >> 40; bytes[3] = value >> 32; bytes[4] = value >> 24; bytes[5] = value >> 16; bytes[6] = value >> 8; bytes[7] = value; } </code></pre> This works fine for clang which compiles this function to: <pre class="prettyprint"><code>bswapq %rdi movq %rdi, (%rsi) retq </code></pre> But GCC fails to detect the byte swap. I tried a couple of different approaches but they only made things worse. I know that GCC can detect byte swaps using bitwise-and, shift, and bitwise-or, but why doesn't it work when writing bytes? Edit: I found the corresponding GCC bug.

This seems to do the trick: <pre class="prettyprint"><code>void encode_bigend_u64(uint64_t value, void* dest) { value = ((value & 0xFF00000000000000u) >> 56u) | ((value & 0x00FF000000000000u) >> 40u) | ((value & 0x0000FF0000000000u) >> 24u) | ((value & 0x000000FF00000000u) >> 8u) | ((value & 0x00000000FF000000u) << 8u) | ((value & 0x0000000000FF0000u) << 24u) | ((value & 0x000000000000FF00u) << 40u) | ((value & 0x00000000000000FFu) << 56u); memcpy(dest, &value, sizeof(uint64_t)); } </code></pre> <h3>clang with <code>-O3</code> </h3> <pre class="prettyprint"><code>encode_bigend_u64(unsigned long, void*): bswapq %rdi movq %rdi, (%rsi) retq </code></pre> <h3>clang with <code>-O3 -march=native</code> </h3> <pre class="prettyprint"><code>encode_bigend_u64(unsigned long, void*): movbeq %rdi, (%rsi) retq </code></pre> <h3>gcc with <code>-O3</code> </h3> <pre class="prettyprint"><code>encode_bigend_u64(unsigned long, void*): bswap %rdi movq %rdi, (%rsi) ret </code></pre> <h3>gcc with <code>-O3 -march=native</code> </h3> <pre class="prettyprint"><code>encode_bigend_u64(unsigned long, void*): movbe %rdi, (%rsi) ret </code></pre> <hr> Tested with clang 3.8.0 and gcc 5.3.0 on http://gcc.godbolt.org/ (so I don't know exactly what processor is underneath (for the <code>-march=native</code>) but I strongly suspect a recent x86_64 processor) <hr> If you want a function which works for big endian architectures too, you can use the answers from here to detect the endianness of the system and add an <code>if</code>. Both the union and the pointer casts versions work and are optimized by both <code>gcc</code> and <code>clang</code> resulting in the exact same assembly (no branches). Full code on godebolt: <pre class="prettyprint"><code>int is_big_endian(void) { union { uint32_t i; char c[4]; } bint = {0x01020304}; return bint.c[0] == 1; } void encode_bigend_u64_union(uint64_t value, void* dest) { if (!is_big_endian()) //... memcpy(dest, &value, sizeof(uint64_t)); } </code></pre> <hr> Intel® 64 and IA-32 Architectures Instruction Set Reference (3-542 Vol. 2A): <blockquote> MOVBE—Move Data After Swapping Bytes Performs a byte swap operation on the data copied from the second operand (source operand) and store the result in the first operand (destination operand). [...] The MOVBE instruction is provided for swapping the bytes on a read from memory or on a write to memory; thus providing support for converting little-endian values to big-endian format and vice versa. </blockquote>

I like Peter's solution, but here's something else you can use on Haswell. Haswell has the <code>movbe</code> instruction, which is 3 uops there (no cheaper than <code>bswap r64</code> + a normal load or store), but is faster on Atom / Silvermont (https://agner.org/optimize/): <pre class="prettyprint"><code>// AT&T syntax, compile without -masm=intel inline uint64_t load_bigend_u64(uint64_t value) { __asm__ ("movbe %[src], %[dst]" // x86-64 only : [dst] "=r" (value) : [src] "m" (value) ); return value; } </code></pre> Use it with something like <code>uint64_t tmp = load_bigend_u64(array[i]);</code> You could reverse this to make a <code>store_bigend</code> function, or use <code>bswap</code> to modify a value in a register and let the compiler load/store it. <hr> I change the function to return <code>value</code> because alignment of <code>vdest</code> was not clear to me. Usually a feature is guarded by a preprocessor macro. I'd expect <code>__MOVBE__</code> to be used for the <code>movbe</code> feature flag, but its not present (this machine has the feature): <pre class="prettyprint"><code>$ gcc -march=native -dM -E - < /dev/null | sort ... #define __LWP__ 1 #define __LZCNT__ 1 #define __MMX__ 1 #define __MWAITX__ 1 #define __NO_INLINE__ 1 #define __ORDER_BIG_ENDIAN__ 4321 ... </code></pre>

How to make GCC generate bswap instruction for big endian store without builtins?

Tags:

intel

Update: This was fixed in GCC 8.1.

I'm working on a function that stores a 64-bit value into memory in big endian format. I was hoping that I could write portable C99 code that works on both little and big endian platforms and have modern x86 compilers generate a bswap instruction automatically without any builtins or intrinsics. So I started with the following function:

#include <stdint.h>

void
encode_bigend_u64(uint64_t value, void *vdest) {
    uint8_t *bytes = (uint8_t *)vdest;
    bytes[0] = value >> 56;
    bytes[1] = value >> 48;
    bytes[2] = value >> 40;
    bytes[3] = value >> 32;
    bytes[4] = value >> 24;
    bytes[5] = value >> 16;
    bytes[6] = value >> 8;
    bytes[7] = value;
}

This works fine for clang which compiles this function to:

bswapq  %rdi
movq    %rdi, (%rsi)
retq

But GCC fails to detect the byte swap. I tried a couple of different approaches but they only made things worse. I know that GCC can detect byte swaps using bitwise-and, shift, and bitwise-or, but why doesn't it work when writing bytes?

Edit: I found the corresponding GCC bug.

799

asked Apr 08 '16 10:04

nwellnhof

3 Answers

This seems to do the trick:

void encode_bigend_u64(uint64_t value, void* dest) {   value =       ((value & 0xFF00000000000000u) >> 56u) |       ((value & 0x00FF000000000000u) >> 40u) |       ((value & 0x0000FF0000000000u) >> 24u) |       ((value & 0x000000FF00000000u) >>  8u) |       ((value & 0x00000000FF000000u) <<  8u) |             ((value & 0x0000000000FF0000u) << 24u) |       ((value & 0x000000000000FF00u) << 40u) |       ((value & 0x00000000000000FFu) << 56u);   memcpy(dest, &value, sizeof(uint64_t)); }

clang with `-O3`

encode_bigend_u64(unsigned long, void*):         bswapq  %rdi         movq    %rdi, (%rsi)         retq

clang with `-O3 -march=native`

encode_bigend_u64(unsigned long, void*):         movbeq  %rdi, (%rsi)         retq

gcc with `-O3`

encode_bigend_u64(unsigned long, void*):         bswap   %rdi         movq    %rdi, (%rsi)         ret

gcc with `-O3 -march=native`

encode_bigend_u64(unsigned long, void*):         movbe   %rdi, (%rsi)         ret

Tested with clang 3.8.0 and gcc 5.3.0 on http://gcc.godbolt.org/ (so I don't know exactly what processor is underneath (for the -march=native) but I strongly suspect a recent x86_64 processor)

If you want a function which works for big endian architectures too, you can use the answers from here to detect the endianness of the system and add an if. Both the union and the pointer casts versions work and are optimized by both gcc and clang resulting in the exact same assembly (no branches). Full code on godebolt:

int is_big_endian(void) {     union {         uint32_t i;         char c[4];     } bint = {0x01020304};      return bint.c[0] == 1; }  void encode_bigend_u64_union(uint64_t value, void* dest) {   if (!is_big_endian())     //...   memcpy(dest, &value, sizeof(uint64_t)); }

Intel® 64 and IA-32 Architectures Instruction Set Reference (3-542 Vol. 2A):

MOVBE—Move Data After Swapping Bytes

Performs a byte swap operation on the data copied from the second operand (source operand) and store the result in the first operand (destination operand). [...]

The MOVBE instruction is provided for swapping the bytes on a read from memory or on a write to memory; thus providing support for converting little-endian values to big-endian format and vice versa.

answered Oct 10 '22 17:10

bolov

All functions in this answer with asm output on the Godbolt Compiler Explorer

GNU C has a uint64_t __builtin_bswap64 (uint64_t x), since GNU C 4.3. This is apparently the most reliable way to get gcc / clang to generate code that doesn't suck for this.

glibc provides htobe64, htole64, and similar host to/from BE and LE functions that swap or not, depending on the endianness of the machine. See the docs for <endian.h>. The man page says they were added to glibc in version 2.9 (released 2008-11).

#define _BSD_SOURCE             /* See feature_test_macros(7) */

#include <stdint.h>

#include <endian.h>
// ideal code with clang from 3.0 onwards, probably earlier
// ideal code with gcc from 4.4.7 onwards, probably earlier
uint64_t load_be64_endian_h(const uint64_t *be_src) { return be64toh(*be_src); }
    movq    (%rdi), %rax
    bswap   %rax

void store_be64_endian_h(uint64_t *be_dst, uint64_t data) { *be_dst = htobe64(data); }
    bswap   %rsi
    movq    %rsi, (%rdi)

// check that the compiler understands the data movement and optimizes away a double-conversion (which inline-asm `bswap` wouldn't)
// it does optimize away with gcc 4.9.3 and later, but not with gcc 4.9.0 (2x bswap)
// optimizes away with clang 3.7.0 and later, but not clang 3.6 or earlier (2x bswap)
uint64_t double_convert(uint64_t data) {
  uint64_t tmp;
  store_be64_endian_h(&tmp, data);
  return load_be64_endian_h(&tmp);
}
    movq    %rdi, %rax

You safely get good code even at -O1 from those functions, and they use movbe when -march is set to a CPU that supports that insn.

If you're targeting GNU C, but not glibc, you can borrow the definition from glibc (remember it's LGPLed code, though):

#ifdef __GNUC__
# if __GNUC_PREREQ (4, 3)

static __inline unsigned int
__bswap_32 (unsigned int __bsx) { return __builtin_bswap32 (__bsx);  }

# elif __GNUC__ >= 2
    // ... some fallback stuff you only need if you're using an ancient gcc version, using inline asm for non-compile-time-constant args
# endif  // gcc version
#endif // __GNUC__

If you really need a fallback that might compile well on compilers that don't support GNU C builtins, the code from @bolov's answer could be used to implement a bswap that compiles nicely. Pre-processor macros could be used to choose whether to swap or not (like glibc does), to implement host-to-BE and host-to-LE functions. The bswap used by glibc when __builtin_bswap or x86 asm isn't available uses the mask-and-shift idiom that bolov found was good. gcc recognizes it better than just shifting.

The code from this Endian-agnostic coding blog post compiles to bswap with gcc, but not with clang. IDK if there's anything that both their pattern-recognizers will recognize.

// Note that this is a load, not a store like the code in the question.
uint64_t be64_to_host(unsigned char* data) {
    return
      ((uint64_t)data[7]<<0)  | ((uint64_t)data[6]<<8 ) |
      ((uint64_t)data[5]<<16) | ((uint64_t)data[4]<<24) |
      ((uint64_t)data[3]<<32) | ((uint64_t)data[2]<<40) |
      ((uint64_t)data[1]<<48) | ((uint64_t)data[0]<<56);
}

    ## gcc 5.3 -O3 -march=haswell
    movbe   (%rdi), %rax
    ret

    ## clang 3.8 -O3 -march=haswell
    movzbl  7(%rdi), %eax
    movzbl  6(%rdi), %ecx
    shlq    $8, %rcx
    orq     %rax, %rcx
    ... completely naive implementation

The htonll from this answer compiles to two 32bit bswaps combined with shift/or. This kind of sucks, but isn't terrible with either gcc or clang.

I didn't have any luck with a union { uint64_t a; uint8_t b[8]; } version of the OP's code. clang still compiles it to a 64bit bswap, but I think compiles to even worse code with gcc. (See the godbolt link).

answered Oct 10 '22 16:10

Peter Cordes

I like Peter's solution, but here's something else you can use on Haswell. Haswell has the movbe instruction, which is 3 uops there (no cheaper than bswap r64 + a normal load or store), but is faster on Atom / Silvermont (https://agner.org/optimize/):

// AT&T syntax, compile without -masm=intel
inline
uint64_t load_bigend_u64(uint64_t value)
{
    __asm__ ("movbe %[src], %[dst]"   // x86-64 only
             :  [dst] "=r" (value)
             :  [src] "m" (value)
            );
    return value;
}

Use it with something like uint64_t tmp = load_bigend_u64(array[i]);

You could reverse this to make a store_bigend function, or use bswap to modify a value in a register and let the compiler load/store it.

I change the function to return value because alignment of vdest was not clear to me.

Usually a feature is guarded by a preprocessor macro. I'd expect __MOVBE__ to be used for the movbe feature flag, but its not present (this machine has the feature):

$ gcc -march=native -dM -E - < /dev/null | sort
...
#define __LWP__ 1
#define __LZCNT__ 1
#define __MMX__ 1
#define __MWAITX__ 1
#define __NO_INLINE__ 1
#define __ORDER_BIG_ENDIAN__ 4321
...

answered Oct 10 '22 17:10

jww

Related questions
                            
                                Can I get rid of a sign-extend between CTZ and addition to a pointer?
                            
                                Segfault with x86 assembly on mov 0, %eax
                            
                                GFortran equivalent of ieee_exceptions
                            
                                How can I make my VS2008 x86 installer install x64 assemblies on x64?
                            
                                How to intercept a hot key in Cocoa when the application window is not active
                            
                                How do programs support i386 and ppc at the same time?
                            
                                f2py with Intel Fortran compiler
                            
                                Working FAT16 Bootloader Generates Read Error on Actual Hardware?
                            
                                How to check keys status in x86 assembly?
                            
                                _mm_set_epi8 - what does "set" mean?
                            
                                intel-xdk: how to delete projects builds created on cloud
                            
                                X86 instructions to power off computer in real mode?
                            
                                Is it okay to mix legacy SSE encoded instructions and VEX encoded ones in the same code path?
                            
                                Odd optimisation problem under MSVC
                            
                                Can you enter x64 32-bit "long compatibility sub-mode" outside of kernel mode?
                            
                                gfortran for dummies: What does mcmodel=medium do exactly?
                            
                                Converting 32-bit Application Into 64-bit Application in C
                            
                                What is the 'shadow space' in x64 assembly?
                            
                                What does ordered / unordered comparison mean?
                            
                                How can objdump emit intel syntax

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to make GCC generate bswap instruction for big endian store without builtins?

Tags:

intel

nwellnhof

People also ask

3 Answers

clang with `-O3`

clang with `-O3 -march=native`

gcc with `-O3`

gcc with `-O3 -march=native`

bolov

Peter Cordes

jww

Recent Activity

Donate For Us

How to make GCC generate bswap instruction for big endian store without builtins?

Tags:

intel

nwellnhof

People also ask

3 Answers

clang with -O3

clang with -O3 -march=native

gcc with -O3

gcc with -O3 -march=native

bolov

Peter Cordes

jww

Related questions

Recent Activity

Donate For Us

clang with `-O3`

clang with `-O3 -march=native`

gcc with `-O3`

gcc with `-O3 -march=native`