Many guides to low latency development discuss aligning memory allocations on particular address boundaries:
https://github.com/real-logic/simple-binary-encoding/wiki/Design-Principles#word-aligned-access
http://www.alexonlinux.com/aligned-vs-unaligned-memory-access
However, the second link is from 2008. Does aligning memory on address boundaries still provide a performance improvement on Intel CPUs in 2019? I thought Intel CPUs no-longer incur a latency penalty accessing unaligned addresses? If not, under what circumstances should this be done? Should I align every stack variable? Class member variable?
Does anybody have any examples where they have found a significant performance improvement from aligning memory?
Align arraysSIMD register-size aligned data accesses are performed much faster by the processor than unaligned ones. In some cases, the compiler and/or hardware can minimize the performance impact, but often significant performance increases—especially for vector codes—can be achieved by ensuring alignment.
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
What is alignment? Alignment refers to the arrangement of data in memory, and specifically deals with the issue of accessing data as proper units of information from main memory. First we must conceptualize main memory as a contiguous block of consecutive memory locations. Each location contains a fixed number of bits.
How can the technician guarantee that the memory is correctly aligned? The label on the memory module should always face the CPU. A notch in the memory module should be aligned with a notch in the memory slot. The arrows on the memory module should be aligned with the arrows on the motherboard slot.
The penalties are usually small, but crossing a 4k page boundary on Intel CPUs before Skylake has a large penalty (~150 cycles). How can I accurately benchmark unaligned access speed on x86_64 has some details on the actual effects of crossing a cache-line boundary or a 4k boundary. (This applies even if the load / store is inside one 2M or 1G hugepage, because the hardware can't know that until after it's started the process of checking the TLB twice.) e.g in an array of double
that was only 4-byte aligned, at a page boundary there'd be one double that was split evenly across two 4k pages. Same for every cache-line boundary.
Regular cache-line splits that don't cross a 4k page cost ~6 extra cycles of latency on Intel (total of 11c on Skylake, vs. 4 or 5c for a normal L1d hit), and cost extra throughput (which can matter in code that normally sustains close to 2 loads per clock.)
Misalignment without crossing a 64-byte cache-line boundary has zero penalty on Intel. On AMD, cache lines are still 64 bytes, but there are relevant boundaries within cache lines at 32 bytes and maybe 16 on some CPUs.
Should I align every stack variable?
No, the compiler already does that for you. x86-64 calling conventions maintain a 16-byte stack alignment so they can get any alignment up to that for free, including 8-byte int64_t
and double
arrays.
Also remember that most local variables are kept in registers for most of the time they're getting heavy use. Unless a variable is volatile
, or you compile without optimization, the value doesn't have to be stored / reloaded between accesses.
The normal ABIs also require natural alignment (aligned to its size) for all the primitive types, so even inside structs and so on you will get alignment, and a single primitive type will never span a cache-line boundary. (exception: i386 System V only requires 4 byte alignment for int64_t
and double
. Outside of structs, the compiler will choose to give them more alignment, but inside structs it can't change the layout rules. So declare your structs in an order that puts the 8-byte members first, or at least laid out so they get 8-byte alignment. Maybe use alignas(8)
on such struct members if you care about 32-bit code, if there aren't already any members that require that much alignment.)
The x86-64 System V ABI (all non-Windows platforms) requires aligning arrays by 16 if they have automatic or static storage outside of a struct. maxalign_t
is 16 on x86-64 SysV so malloc
/ new
return 16-byte aligned memory for dynamic allocation. gcc targeting Windows also aligns stack arrays if it auto-vectorizes over them in that function.
(If you cause undefined behaviour by violating the ABI's alignment requirements, it often doesn't make any performance different. It usually doesn't cause correctness problems x86, but it can lead to faults for SIMD type, and with auto-vectorization of scalar types. e.g. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?. So if you intentionally misalign data, make sure you don't access it with any pointer wider than char*
.
e.g. use memcpy(&tmp, buf, 8)
with uint64_t tmp
to do an unaligned load. gcc can autovectorize through that, IIRC.)
You might sometimes want to alignas(32)
or 64 for large arrays, if you compile with AVX or AVX512 enabled. For a SIMD loop over a big array (that doesn't fit in L2 or L1d cache), with AVX/AVX2 (32-byte vectors) there's usually near-zero effect from making sure it's aligned by 32 on Intel Haswell/Skylake. Memory bottlenecks in data coming from L3 or DRAM will give the core's load/store units and L1d cache time to do multiple accesses under the hood, even if every other load/store crosses a cache-line boundary.
But with AVX512 on Skylake-server, there is a significant effect in practice for 64-byte alignment of arrays, even with arrays that are coming from L3 cache or maybe DRAM. I forget the details, it's been a while since I looked at an example, but maybe 10 to 15% even for a memory-bound loop? Every 64-byte vector load and store will cross a 64-byte cache line boundary if they aren't aligned.
Depending on the loop, you can handle under-aligned inputs by doing a first maybe-unaligned vector, then looping over aligned vectors until the last aligned vector. Another possibly-overlapping vector that goes to the end of the array can handle the last few bytes. This works great for a copy-and-process loop where it's ok to re-copy and re-process the same elements in the overlap, but there are other techniques you can use for other cases, e.g. a scalar loop up to an alignment boundary, narrower vectors, or masking. If your compiler is auto-vectorizing, it's up to the compiler to choose. If you're manually vectorizing with intrinsics, you get to / have to choose. If arrays are normally aligned, it's a good idea to just use unaligned loads (which have no penalty if the pointers are aligned at runtime), and let the hardware handle the rare cases of unaligned inputs so you don't have any software overhead on aligned inputs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With