Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to align stack at 32 byte boundary in GCC?

Tags:

stack

gcc

avx

sse

I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx.

But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS and VMOVAPD to move __m256 and __m256d around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment.

How can I change the GCC's stack alignment to 32 bytes?

I have tried using -mstackrealign but to no avail, since that aligns only to 16 bytes. I couldn't make __attribute__((force_align_arg_pointer)) work either, it aligns to 16 bytes anyway. I haven't been able to find any other compiler options that would address this. Any help is greatly appreciated.

EDIT: I tried using -mpreferred-stack-boundary=5, but GCC says that 5 is not supported for this target. I'm out of ideas.

like image 989
Norbert P. Avatar asked May 12 '11 19:05

Norbert P.


People also ask

Is malloc 16 byte aligned?

The GNU documentation states that malloc is aligned to 16 byte multiples on 64 bit systems.

What does 4-byte aligned mean?

A 1-byte variable (typically a char in C/C++) is always aligned. A 2-byte variable (typically a short in C/C++) in order to be aligned must lie at an address divisible by 2. A 4-byte variable (typically an int in C/C++) must lie at an address divisible by 4 and so on.

What is stack alignment?

IIRC, stack alignment is when variables are placed on the stack "aligned" to a particular number of bytes. So if you are using a 16 bit stack alignment, each variable on the stack is going to start from a byte that is a multiple of 2 bytes from the current stack pointer within a function.

What is aligned data in bit?

Data alignment means that the address of a data can be evenly divisible by 1, 2, 4, or 8. In other words, data object can have 1-byte, 2-byte, 4-byte, 8-byte alignment or any power of 2.


1 Answers

I have been exploring the issue, filed a GCC bug report, and found out that this is a MinGW64 related problem. See GCC Bug#49001. Apparently, GCC doesn't support 32-byte stack alignment on Windows. This effectively prevents the use of 256-bit AVX instructions.

I investigated a couple ways how to deal with this issue. The simplest and bluntest solution is to replace of aligned memory accesses VMOVAPS/PD/DQA by unaligned alternatives VMOVUPS etc. So I learned Python last night (very nice tool, by the way) and pulled off the following script that does the job with an input assembler file produced by GCC:

import re
import fileinput
import sys

# fix aligned stack access
# replace aligned vmov* by unaligned vmov* with 32-byte aligned operands 
# see Intel's AVX programming guide, page 39
vmova = re.compile(r"\s*?vmov(\w+).*?((\(%r.*?%ymm)|(%ymm.*?\(%r))")
aligndict = {"aps" : "ups", "apd" : "upd", "dqa" : "dqu"};
for line in fileinput.FileInput(sys.argv[1:],inplace=1):
    m = vmova.match(line)
    if m and m.group(1) in aligndict:
        s = m.group(1)
        print line.replace("vmov"+s, "vmov"+aligndict[s]),
    else:
        print line,

This approach is pretty safe and foolproof. Though I observed a performance penalty on rare occasions. When the stack is unaligned, the memory access crosses the cache line boundary. Fortunately, the code performs as fast as aligned accesses most of the time. My recommendation: inline functions in critical loops!

I also attempted to fix the stack allocation in every function prolog using another Python script, trying to align it always at the 32-byte boundary. This seems to work for some code, but not for other. I have to rely on the good will of GCC that it will allocate aligned local variables (with respect to the stack pointer), which it usually does. This is not always the case, especially when there is a serious register spilling due to the necessity to save all ymm register before a function call. (All ymm registers are callee-save). I can post the script if there's an interest.

The best solution would be to fix GCC MinGW64 build. Unfortunately, I have no knowledge of its internal workings, just started using it last week.

like image 85
Norbert P. Avatar answered Oct 15 '22 22:10

Norbert P.