An older answer indicates that aarch64 supports unaligned reads/writes and has a mention about performance cost, but it's unclear if the answer covers only the ALU or SIMD (128-bit register) operations, too.
Relative to aligned 128-bit NEON loads and stores, how much slower (if at all) are unaligned 128-bit NEON loads and stores on aarch64?
Are there separate instructions for unaligned SIMD loads and stores (as is the case with SSE2) or are the known-aligned loads/stores the same instructions as potentially-unaligned loads/stores?
According to the Cortex-A57 Software Optimization Guide in section 4.6 Load/Store Alignment it says:
The ARMv8-A architecture allows many types of load and store accesses to be arbitrarily aligned. The Cortex-A57 processor handles most unaligned accesses without performance penalties. However, there are cases which reduce bandwidth or incur additional latency, as described below:
- Load operations that cross a cache-line (64-byte) boundary
- Store operations that cross a 16-byte boundary
So it may depend on the processor that you are using, out of order (A57, A72, A-72, A-75) or in-order (A-35, A-53, A-55). I didn't find any optimization guide for the in-order processors, however they do have a Hardware Performance Counter that you could use to check if the number of unaligned instructions do affect performance:
0xOF_UNALIGNED_LDST_RETIRED Unaligned load-store
This can be used with the perf
tool.
There are no special instructions for unaligned accesses in AArch64.
If a load/store has to be split or crosses a cache line, at least one extra cycle is required.
There are exhaustive tables that specify the number of cycles required for various alignments and numbers of registers for the Cortex-A8 (in-order) and Cortex-A9 (partially OoO). For example, vld1
with one reg has a 1-cycle penalty for unaligned access vs. 64 bit-aligned access.
The Cortex-A55 (in-order) does up to 64-bit loads and 128-bit stores, and accordingly, section 3.3 of its optimization manual states a 1-cycle penalty is incurred for:
• Load operations that cross a 64-bit boundary
• 128-bit store operations that cross a 128-bit boundary
The Cortex-A75 (OoO) has penalties per section 5.4 of its optimization guide for:
• Load operations that cross a 64-bit boundary.
• In AArch64, all stores that cross a 128-bit boundary.
• In AArch32, all stores that cross a 64-bit boundary.
And as in Guillermo's answer, the A57 (OoO) has penalties for:
• Load operations that cross a cache-line (64-byte) boundary
• Store operations that cross a [128-bit] boundary
I'm somewhat skeptical that the A57 does not have a penalty for crossing 64-bit boundaries given that the A55 and A75 do. All of these have 64-byte cache lines; they should all have penalties for crossing cache lines too. Finally, note that there's unpredictable behavior for split access crossing pages.
From some crude testing (without perf counters) with a Cavium ThunderX, there seems to be closer to a 2-cycle penalty, but that might be an additive effect of having back-to-back unaligned loads and stores in a loop.
AArch64 NEON instructions don't distinguish between aligned and un-aligned (see LD1 for example). For AArch32 NEON, alignment is specified statically in the addressing (VLDn):
vld1.32 {d16-d17}, [r0] ; no alignment
vld1.32 {d16-d17}, [r0@64] ; 64-bit aligned
vld1.32 {d16-d17}, [r0:64] ; 64 bit-aligned, used by GAS to avoid comment ambiguity
I don't know if aligned access without alignment qualifiers performs any slower than access with alignment qualifiers on recent chips running in AArch32 mode. Somewhat old documentation from ARM encourages using qualifiers whenever possible. (Intel refined their chips such that unaligned and aligned moves perform the same when the address is aligned, by comparison.)
If you're using intrinsics, MSVC has _ex
-suffixed variants that accept the alignment. A reliable way to get GCC to emit an alignment qualifier is with __builtin_assume_aligned
.
// MSVC
vld1q_u16_ex(addr, 64);
// GCC:
addr = (uint16_t*)__builtin_assume_aligned(addr, 8);
vld1q_u16(addr);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With