Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why 128bit variables should be aligned to 16Byte boundary

As we know, X86 CPU has a 64bit data bus. My understanding is that CPU can't access to arbitrary address. The address that CPU could access to is a integral multiple of the width of its data bus. For the performance, variables should start at(aligned to) these addresses to avoid extra memory access. 32bit variables aligned to 4Byte boundry will be automatically aligned to 8Byte(64bit) boundry, which corresponds to x86 64bit data bus. But why compilers align 128bit variables to 16Byte boundry? Not the 8Byte boundry?

Thanks

Let me make things more specific. Compilers use the length of a variable to align it. For example, if a variable has 256bit length, Complier will align it to 32Byte boundry. I don't think there is any kind of CPU has that long data-bus. Furthermore, common DDR memories only transfer 64bit data one time, despite of the cache, how could a memory fill up CPU's wider data-bus? or only by means of cache?

like image 823
iqapple Avatar asked May 22 '13 23:05

iqapple


People also ask

What does it mean to be 16 byte aligned?

This effectively means that the address of the memory your data resides in needs to be divisible by the number of bytes required by the instruction. So in your case the alignment is 16 bytes (128 bits), which means the memory address of your data needs to be a multiple of 16.

Why does data need to be aligned?

Data alignment is a prerequisite for Data integration. Two datasets from different sources can only be integrated if the Variables, and the values of those Variables, can be matched between both datasets.

What does it mean to be byte aligned?

"X bytes aligned" means that the base address of your data must be a multiple of X. It can be used for using some special hardware like a DMA in some special hardware, for a faster access by the cpu, etc...

Is malloc 16 byte aligned?

The GNU documentation states that malloc is aligned to 16 byte multiples on 64 bit systems.


2 Answers

One reasons is that most SSE2 instructions on X86 require the data to be 128 bit aligned. This design decision would have been made for performance reasons and to avoid overly complex (and hence slow and big) hardware.

like image 115
Bull Avatar answered Sep 30 '22 05:09

Bull


There are so many different processor models that I am going to answer this only in theoretical and general terms.

Consider an array of 16-byte objects that starts at an address that is a multiple of eight bytes but not of 16 bytes. Let’s suppose the processor has an eight-byte bus, as indicated in the question, even if some processors do not. However, note that at some point in the array, one of the objects must straddle a page boundary: Memory mapping commonly works in 4096-byte pages that start on 4096-byte boundaries. With an eight-byte-aligned array, some element of the array will start at byte 4088 of one page and continue up to byte 7 of the next page.

When a program tries to load the 16-byte object that crosses a page boundary, it can no longer do a single virtual-to-physical memory map. It has to do one lookup for the first eight bytes and another lookup for the second eight bytes. If the load/store unit is not designed for this, then the instruction needs special handling. The processor might abort its initial attempt to execute the instruction, divide it into two special microinstructions, and send those back into the instruction queue for execution. This can delay the instruction by many processor cycles.

In addition, as Hans Passant noted, alignment interacts with cache. Each processor has a memory cache, and it is common for cache to be organized into 32-byte or 64-byte “lines”. If you load a 16-byte object that is 16-byte aligned, and the object is in cache, then the cache can supply one cache line that contains the needed data. If you are loading 16-byte objects from an array that is not 16-byte aligned, then some of the objects in the array will straddle two cache lines. When these objects are loaded, two lines must be fetched from the cache. This may take longer. Even if it does not take longer to get two lines, perhaps because the processor is designed to provide two cache lines per cycle, this can interfere with other things that a program is doing. Commonly, a program will load data from multiple places. If the loads are efficient, the processor may be able to perform two at once. But if one of them requires two cache lines instead of the normal one, then it blocks simultaneous execution of other load operations.

Additionally, some instructions explicitly require aligned addresses. The processor might dispatch these instructions more directly, bypassing some of the tests that fix up operations without aligned addresses. When the addresses of these instructions are resolved and are found to be misaligned, the processor must abort them, because the fix-up operations have been bypassed.

like image 26
Eric Postpischil Avatar answered Sep 30 '22 03:09

Eric Postpischil