why 128bit variables should be aligned to 16Byte boundary

Tags:

As we know, X86 CPU has a 64bit data bus. My understanding is that CPU can't access to arbitrary address. The address that CPU could access to is a integral multiple of the width of its data bus. For the performance, variables should start at(aligned to) these addresses to avoid extra memory access. 32bit variables aligned to 4Byte boundry will be automatically aligned to 8Byte(64bit) boundry, which corresponds to x86 64bit data bus. But why compilers align 128bit variables to 16Byte boundry? Not the 8Byte boundry?

Thanks

Let me make things more specific. Compilers use the length of a variable to align it. For example, if a variable has 256bit length, Complier will align it to 32Byte boundry. I don't think there is any kind of CPU has that long data-bus. Furthermore, common DDR memories only transfer 64bit data one time, despite of the cache, how could a memory fill up CPU's wider data-bus? or only by means of cache?

823

asked May 22 '13 23:05

iqapple

2 Answers

One reasons is that most SSE2 instructions on X86 require the data to be 128 bit aligned. This design decision would have been made for performance reasons and to avoid overly complex (and hence slow and big) hardware.

115

answered Sep 30 '22 05:09

Bull

There are so many different processor models that I am going to answer this only in theoretical and general terms.

Consider an array of 16-byte objects that starts at an address that is a multiple of eight bytes but not of 16 bytes. Let’s suppose the processor has an eight-byte bus, as indicated in the question, even if some processors do not. However, note that at some point in the array, one of the objects must straddle a page boundary: Memory mapping commonly works in 4096-byte pages that start on 4096-byte boundaries. With an eight-byte-aligned array, some element of the array will start at byte 4088 of one page and continue up to byte 7 of the next page.

When a program tries to load the 16-byte object that crosses a page boundary, it can no longer do a single virtual-to-physical memory map. It has to do one lookup for the first eight bytes and another lookup for the second eight bytes. If the load/store unit is not designed for this, then the instruction needs special handling. The processor might abort its initial attempt to execute the instruction, divide it into two special microinstructions, and send those back into the instruction queue for execution. This can delay the instruction by many processor cycles.

In addition, as Hans Passant noted, alignment interacts with cache. Each processor has a memory cache, and it is common for cache to be organized into 32-byte or 64-byte “lines”. If you load a 16-byte object that is 16-byte aligned, and the object is in cache, then the cache can supply one cache line that contains the needed data. If you are loading 16-byte objects from an array that is not 16-byte aligned, then some of the objects in the array will straddle two cache lines. When these objects are loaded, two lines must be fetched from the cache. This may take longer. Even if it does not take longer to get two lines, perhaps because the processor is designed to provide two cache lines per cycle, this can interfere with other things that a program is doing. Commonly, a program will load data from multiple places. If the loads are efficient, the processor may be able to perform two at once. But if one of them requires two cache lines instead of the normal one, then it blocks simultaneous execution of other load operations.

Additionally, some instructions explicitly require aligned addresses. The processor might dispatch these instructions more directly, bypassing some of the tests that fix up operations without aligned addresses. When the addresses of these instructions are resolved and are found to be misaligned, the processor must abort them, because the fix-up operations have been bypassed.

answered Sep 30 '22 03:09

Eric Postpischil

Related questions
                            
                                Why can't a const T*& bind to a T*?
                            
                                Where/how to place build files in OpenCV
                            
                                Basic polymorphic pointers to base classes
                            
                                Drawing a rectangle in Direct X
                            
                                C++ : sharing fields between class and superclasses
                            
                                Variadic templates and typesafety
                            
                                Does an odd number always return floor when divided with a remainder?
                            
                                const_cast a const member in a class constructor
                            
                                Why can't my Curiously Recurring Template Pattern (CRTP) refer to the derived class's typedefs? [duplicate]
                            
                                Managed target code requires a '\clr' option : Error
                            
                                Why is Boost Graph Library's `source()` a global function?
                            
                                Block Matrix Multiplication [closed]
                            
                                std::function and std::bind behavior
                            
                                lz4 compression c++ example [duplicate]
                            
                                C++ Array of Functions
                            
                                Pointer to variadic function template
                            
                                c++ array syntax (function returns array)
                            
                                Create Mock for a constant method in Turtle
                            
                                What is the fastest way to read a sequence of images?
                            
                                Win32 API : how to make Edit Text to accept unsigned float only in c++?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

why 128bit variables should be aligned to 16Byte boundary

Tags:

c++

c

memory-management

x86

assembly

iqapple

People also ask

2 Answers

Bull

Eric Postpischil

Recent Activity

Donate For Us