I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx
.
But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS
and VMOVAPD
to move __m256
and __m256d
around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment.
How can I change the GCC's stack alignment to 32 bytes?
I have tried using -mstackrealign
but to no avail, since that aligns only to 16 bytes. I couldn't make __attribute__((force_align_arg_pointer))
work either, it aligns to 16 bytes anyway. I haven't been able to find any other compiler options that would address this. Any help is greatly appreciated.
EDIT:
I tried using -mpreferred-stack-boundary=5
, but GCC says that 5 is not supported for this target. I'm out of ideas.
The GNU documentation states that malloc is aligned to 16 byte multiples on 64 bit systems.
A 1-byte variable (typically a char in C/C++) is always aligned. A 2-byte variable (typically a short in C/C++) in order to be aligned must lie at an address divisible by 2. A 4-byte variable (typically an int in C/C++) must lie at an address divisible by 4 and so on.
IIRC, stack alignment is when variables are placed on the stack "aligned" to a particular number of bytes. So if you are using a 16 bit stack alignment, each variable on the stack is going to start from a byte that is a multiple of 2 bytes from the current stack pointer within a function.
Data alignment means that the address of a data can be evenly divisible by 1, 2, 4, or 8. In other words, data object can have 1-byte, 2-byte, 4-byte, 8-byte alignment or any power of 2.
I have been exploring the issue, filed a GCC bug report, and found out that this is a MinGW64 related problem. See GCC Bug#49001. Apparently, GCC doesn't support 32-byte stack alignment on Windows. This effectively prevents the use of 256-bit AVX instructions.
I investigated a couple ways how to deal with this issue. The simplest and bluntest solution is to replace of aligned memory accesses VMOVAPS/PD/DQA by unaligned alternatives VMOVUPS etc. So I learned Python last night (very nice tool, by the way) and pulled off the following script that does the job with an input assembler file produced by GCC:
import re
import fileinput
import sys
# fix aligned stack access
# replace aligned vmov* by unaligned vmov* with 32-byte aligned operands
# see Intel's AVX programming guide, page 39
vmova = re.compile(r"\s*?vmov(\w+).*?((\(%r.*?%ymm)|(%ymm.*?\(%r))")
aligndict = {"aps" : "ups", "apd" : "upd", "dqa" : "dqu"};
for line in fileinput.FileInput(sys.argv[1:],inplace=1):
m = vmova.match(line)
if m and m.group(1) in aligndict:
s = m.group(1)
print line.replace("vmov"+s, "vmov"+aligndict[s]),
else:
print line,
This approach is pretty safe and foolproof. Though I observed a performance penalty on rare occasions. When the stack is unaligned, the memory access crosses the cache line boundary. Fortunately, the code performs as fast as aligned accesses most of the time. My recommendation: inline functions in critical loops!
I also attempted to fix the stack allocation in every function prolog using another Python script, trying to align it always at the 32-byte boundary. This seems to work for some code, but not for other. I have to rely on the good will of GCC that it will allocate aligned local variables (with respect to the stack pointer), which it usually does. This is not always the case, especially when there is a serious register spilling due to the necessity to save all ymm register before a function call. (All ymm registers are callee-save). I can post the script if there's an interest.
The best solution would be to fix GCC MinGW64 build. Unfortunately, I have no knowledge of its internal workings, just started using it last week.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With