I always hear that unaligned accesses are bad because they will either cause runtime errors and crash the program or slow memory accesses down. However I can't find any actual data on how much they will slow things down.
Suppose I'm on x86 and have some (yet unknown) share of unaligned accesses - what's the worst slowdown actually possible and how do I estimate it without eliminating all unaligned accesses and comparing run time of two versions of code?
Unaligned memory accesses occur when you try to read N bytes of data starting from an address that is not evenly divisible by N (i.e. addr % N != 0). For example, reading 4 bytes of data from address 0x10004 is fine, but reading 4 bytes of data from address 0x10005 would be an unaligned memory access.
A memory access is said to be aligned when the data being accessed is n bytes long and the datum address is n-byte aligned. When a memory access is not aligned, it is said to be misaligned. Note that by definition byte memory accesses are always aligned.
It depends on the instruction(s), for most x86 SSE load/store instructions (excluding unaligned variants), it will cause a fault, which means it'll probably crash your program or lead to lots of round trips to your exception handler (which means almost or all performance is lost). The unaligned load/store variants run at double the amount of cycles IIRC, as they perform partial read/writes, so 2 are required to perform the operation (unless you are lucky and its in cache, which greatly reduces the penalty).
For general x86 load/store instructions, the penalty is speed, as more cycles are required to do the read or write. unalignment may also affect caching, leading to cache line splitting, and cache boundary straddling. It also prevents atomicity on reads and writes (which are guaranteed for all aligned read/writes of x86, barriers and propagation is something else, but using LOCK'ed instruction on unaligned data may cause and exception or greatly increase the already massive penalty the bu lock incurs), which is a no-no for concurrent programming.
Intels x86 & x64 optimizations manual goes into great detail about each aforementioned problem, their side-effects and how to remedy them.
Agner Fog' optimization manuals should have the exact numbers you are looking for in terms of raw cycle throughput.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With