How does endianness work with SIMD registers?

Question

I'm working with integers and SSE and have become very confused about how endianness affects moving data in and out of registers.

My initial, wrong, understanding

Initially my understanding was as follows. If I have an array of 4 byte integers the memory would be laid out as follows since x86 architectures are little endian:

0D 0C 0B 0A 1D 1C 1B 1A 2D 2C 2B 2A .... nD nC nB nA

Where the letters A, B, C and D index the bytes within an integer element, and numbers index the element.

In an XMM register, my understanding is that four integers would be laid out as follows:

0A 0B 0C 0D 1A 1B 1C 1D 2A 2B 2C 2D 3A 3B 3C 3D

However, I'm pretty sure this picture is wrong for several reasons. The first is the documentation for the mm_load_si128 intrinsic, which is supposed to work for any integer data, but in the above picture should only work for one word size. Similarly there is this (archived) thread. Finally I see people writing code like the following:

__declspec(align(16)) int32_t A[N];
__m128i* As = (__m128i*)A;

A potentially correct picture

The Wikipedia article on endianness says I should think of memory addresses increasing right to left. How about the following picture for memory then?

nA nB nC nD ... 2A 2B 2C 2D 1A 1B 1C 1D 0A 0B 0C 0D

And then in a register:

3A 3B 3C 3D 2A 2B 2C 2D 1A 1B 1C 1D 0A 0B 0C 0D

Z boson · Accepted Answer

It's just a question of interpretation. We read/write digits of a number from left to right and highest digit to lowest digit. So for a 32-bit number with the highest byte A then B then C and lowest byte D we would read/write ABCD. We do the same notating a 128-bit integer.

3A3B3C3D 2A2B2C2D 1A1B1C1D 0A0B0C0D

But in a little endian system it reads and writes digits from the lowest address to the highest like this

0D0C0B0A 1D1C1B1A 2D2C2B2A 3D3C3B3A

For 16-bit integers it's the same logic. We could read/write it as

7A7B 6A6B 5A5B 4A4B 3A3B 2A2B 1A1B 0A0B

and the little endian computer read/stores it from lowest to highest address as

0B0A 1B1A 2B2A 3B3A 4B4A 5B5A 6A6B 7B7A

That's why there is only one instruction to read/write 32-bit, 16-bit and 8-byte integers int a 128-bit register: namely movdqa and movaps (or the unaligned variants movdqu and movups).

Glenn Slayden · Answer

Elaborating on Peter's comment, there is most certainly an implicit endianness inherent to the SIMD implementation, owing to the fact that there are instructions that can:

Read and write 128-bit SIMD registers from memory. Since memory must always be accessed by byte offset (regardless of the instruction or how many bytes it stores or fetches), and can subsequently be examined by other non-SIMD means, the movaps, movdqa, movdqu, etc. instructions inherently imply an endianness.
Index vector elements with instructions like pshufd and even runtime-variable indexing with pshufb that use an integer index to select elements. This means that elements have something like addresses, and wide elements contain multiple independently-addressable narrow elements. (Not part of memory address space, of course, but unlike scalar registers we have a 2nd way to talk about position other than left/right shift within a wide element. This is the same thing that makes endianness an issue for memory.) The indexing of elements within a register is chosen to match the order in memory (little-endian), but it could have been different.
Shift bits across byte boundaries of a SIMD register with pslld, psrld, or whole-vector byte-shifts like pslldq etc. Note that crossing "byte boundaries" includes within the individual word, dword, or qword components, because (for the same reason noted in the previous point), the register can be subsequently be imaged to memory. A byte-shift of a whole vector groups the low byte of one word with the high byte of the adjacent word, in a way that depends on endianness.
Re-interpret the component size (byte, word, dword, or qword) of an existing SIMD register's contents. This is the analogue of reading the bytes of a dword in memory: they have an order. Shuffling around qwords using pshufd requires you to consider the endianness when choosing the shuffle control, to keep the right high:low pairs of dwords grouped in the right order.

So while it's true that if you never do any of these things, meaning you exclusively use SIMD memory images with SIMD registers, with matching component sizes and never examine that memory otherwise, and also maintain consistent component sizes in operations on those SIMD registers, then you don't have to worry about SIMD endianness. Otherwise, read on...

Knowing now that the SIMD operations listed above expose an endianness, what then is it? Well, we already know that Intel architecture is little-endian, meaning word, dword, and qword (respectively, 16-, 32-, and 64-bit memory accesses) are recursively swapped. For example, storing a single qword swaps the stores of its two dwords, each of which swaps its two words, each of which swaps its two bytes. This results in the memory image of a CPU register having a reversed byte order overall.

For compatibility with non-SIMD instructions operating on the same size, the memory image of each individual component of a SIMD register should be bit-identical, for every component size, with existing (little-endian) format. The prior existence of non-SIMD instructions for word, dword, and qword accesses thus represent hard constraints, and those SIMD components must manifest little-endian images.

But there are no prior non-SIMD instructions for 128-bit memory access, so there isn't a prior constraint on the (qword, qword) layout of the dqword SIMD register itself. That leaves really just the one possible question we could be asking here: does the recursive little-endian swapping pattern (word, dword, qword, ...?) continue, applying to dqword values as well? In other words, in the 16-byte memory image of a SIMD register, which of its two qword components—the numerically least-significant, or the more-significant—is stored in the lower-addressed 8-bytes?

ANSWER: The least-significant qword is stored at the l̲o̲w̲e̲r̲-address 8-bytes.

This preserves the symmetry of the "little-endian" recursive swapping, by extending the pattern to include dqword values as well. To summarize, a 128-bit SIMD register is little-endian, because its memory image at [esi] has:

the less-significant qword (SIMD index 0) at the lower address qword ptr [esi],
the more-significant qword (SIMD index 1) at the higher address qword ptr [esi + 8].

How does endianness work with SIMD registers?

Tags:

x86

simd

endianness

sse

My initial, wrong, understanding

A potentially correct picture

kiyo

2 Answers

Z boson

Glenn Slayden

Recent Activity

Donate For Us

How does endianness work with SIMD registers?

Tags:

x86

simd

endianness

sse

My initial, wrong, understanding

A potentially correct picture

kiyo

2 Answers

Z boson

Glenn Slayden

Related questions

Recent Activity

Donate For Us