Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to swap the bytes of an unaligned 64 bit value in memory?

I have a large number of 64 bit values in memory. Unfortunately they might not be aligned to 64 bit addresses. My goal is to change the endianess of all those values, i.e. swapping/reversing their bytes.

I know about the bswap instruction which swaps the bytes of a 32 or 64 bit register. But as it requires a register argument, I cannot pass it my memory address. Of course I can first load the memory into register, then swap, then write it back:

mov rax, qword [rsi]
bswap rax
mov qword [rsi], rax

But is that even correct, given that the address might be unaligned?

Another possibility is to do the swaps manually:

mov al, byte [rsi + 0]
mov bl, byte [rsi + 7]
mov byte [rsi + 0], bl
mov byte [rsi + 7], al

mov al, byte [rsi + 1]
mov bl, byte [rsi + 6]
mov byte [rsi + 1], bl
mov byte [rsi + 6], al

mov al, byte [rsi + 2]
mov bl, byte [rsi + 5]
mov byte [rsi + 2], bl
mov byte [rsi + 5], al

mov al, byte [rsi + 3]
mov bl, byte [rsi + 4]
mov byte [rsi + 3], bl
mov byte [rsi + 4], al

That's obviously a lot more instructions. But is it slower, too?

But all in all I'm still pretty inexperienced in x86-64, so I wonder: What is the fastest way to byte swap a 64 bit value in memory? Is one of the two options I described optimal? Or is there a completely different approach that is even faster?

PS: My real situation is a bit more complicated. I do have a large byte array, but it contains differently sized integers, all densely packed. Some other array tells me what size of integer to expect next. So this "description" could say "one 32 bit int, two 64 bit ints, one 16 bit int, then one 64 bit int again". I am just mentioning this here to tell you that (as far as I can tell), using SIMD instructions is not possible as I actually have to inspect the size of each integer before reading.

like image 367
Lukas Kalbertodt Avatar asked Oct 16 '22 02:10

Lukas Kalbertodt


1 Answers

What is the fastest way to byte swap a 64 bit value in memory?

The mov/bswap/mov version and the movbe/mov are about the same on most Intel processors. Based on the µop count, it seems movbe decodes to mov + bswap, except on Atom. For Ryzen, movbe may be better. Manually swapping around bytes is much slower, except in certain edge cases where a large load/store is very slow, such as when it crosses a 4K boundary pre-Skylake.

pshufb is a reasonable option even to replace a single bswap, though that wastes half of the work the shuffle could do.


PS: My real situation is a bit more complicated. I do have a large byte array, but it contains differently sized integers, all densely packed.

In this general case, with sizes dynamically taken from an other data stream, a new big issue is branching on the size. Even in scalar code that can be avoided, by byte-reversing a 64bit block and shifting it right by 8 - size, then merging it with the un-reversed bytes, and advancing by size. That could be worked out, but it's a waste of time to try that, the SIMD version will be better.

A SIMD version could use pshufb and a table of a shuffle-masks indexed by a "size pattern", for example an 8-bit integer where every 2 bits indicates the size of an element. pshufb then reverses the elements that are wholly contained in the 16-byte window that it's looking at, and leave the rest alone (those unchanged bytes at the tail will be written back too, but that's OK). Then we advance by the number of bytes that was actually processed.

For maximum convenience, those size patterns (as well as corresponding byte-counts) should be supplied in such a way that the actual Endianness Flipper itself can consume exactly one of them per iteration, without anything heavy such as extracting a byte-unaligned sequence of 8 bits and determining dynamically how many bits to consume. That's also possible, but at a significantly higher cost. About 4x as slow in my test, limited by the loop-carried dependency through "extract 8 bits at current bit-index" through "find bit-index increment by table lookup" and then into the next iteration: about 16 cycles per iteration, though still in 60% of the time that equivalent scalar code took.

Using an unpacked (1 byte per size) representation would make the extraction easier (just an unaligned dword load), but requires packing the result to index the shuffle mask table with, for example with pext. That would be reasonable for Intel CPUs, but pext is extremely slow on AMD Ryzen. A alternative that is fine for both AMD and Intel would be to do the unaligned dword read, then extract the 8 interesting bits using a multiply/shift trick:

mov eax, [rdi]
imul eax, eax, 0x01041040
shr eax, 24

An extra trick that should be used, in the Convenient Input case at least (otherwise we're stuck with a 5x worse performance anyway and this trick won't be relevant), is reading the data for the next iteration before storing the result of the current iteration. Without that trick, the store will often "step on the toes" of the load of the next iteration (because we advance less than 16 bytes, so the load reads some of the bytes that the store left unchanged but had to write anyway), forcing a memory dependency between them which hold up the next iteration. The performance difference is large, about 3x.

Then the Endianness Flipper could look something like this:

void flipEndiannessSSSE3(char* buffer, size_t totalLength, uint8_t* sizePatterns, uint32_t* lengths, __m128i* masks)
{
    size_t i = 0;
    size_t j = 0;
    __m128i data = _mm_loadu_si128((__m128i*)buffer);
    while (i < totalLength) {
        int sizepattern = sizePatterns[j];
        __m128i permuted = _mm_shuffle_epi8(data, masks[sizepattern]);
        size_t next_i = i + lengths[j++];
        data = _mm_loadu_si128((__m128i*)&buffer[next_i]);
        _mm_storeu_si128((__m128i*)&buffer[i], permuted);
        i = next_i;
    }
}

For example, Clang 10 with -O3 -march=haswell turns that into

    test    rsi, rsi
    je      .LBB0_3
    vmovdqu xmm0, xmmword ptr [rdi]
    xor     r9d, r9d
    xor     r10d, r10d
.LBB0_2:                            # =>This Inner Loop Header: Depth=1
    movzx   eax, byte ptr [rdx + r10]
    shl     rax, 4
    vpshufb xmm1, xmm0, xmmword ptr [r8 + rax]
    mov     eax, dword ptr [rcx + 4*r10]
    inc     r10
    add     rax, r9
    vmovdqu xmm0, xmmword ptr [rdi + rax]
    vmovdqu xmmword ptr [rdi + r9], xmm1
    mov     r9, rax
    cmp     rax, rsi
    jb      .LBB0_2
.LBB0_3:
    ret

LLVM-MCA thinks that takes about 3.3 cycles per iteration, on my PC (4770K, tested with a uniform mix of 1, 2, 4 and 8 byte sized elements) it was a little slower, closer to 3.7 cycles per iteration, but that's still good: that's just under 1.2 cycles per element.

like image 111
harold Avatar answered Nov 14 '22 21:11

harold