I have performance critical code and there is a huge function that allocates like 40 arrays of different size on the stack at the beginning of the function. Most of these arrays have to have certain alignment (because these arrays are accessed somewhere else down the chain using cpu instructions that require memory alignment (for Intel and arm CPUs).
Since some versions of gcc simply fail to align stack variables properly (notably for arm code), or even sometimes it says that maximum alignment for the target architecture is less than what my code actually requests, I simply have no choice but to allocate these arrays on the stack and align them manually.
So, for each array I need to do something like that to get it aligned properly:
short history_[HIST_SIZE + 32];
short * history = (short*)((((uintptr_t)history_) + 31) & (~31));
This way, history
is now aligned on 32-byte boundary. Doing the same is tedious for all 40 arrays, plus this part of code is really cpu intensive and I simply cannot do the same alignment technique for each of the arrays (this alignment mess confuses the optimizer and different register allocation slows down the function big time, for better explanation see explanation at the end of the question).
So... obviously, I want to do that manual alignment only once and assume that these arrays are located one right after the other. I also added extra padding to these arrays so that they are always multiple of 32 bytes. So, then I simply create a jumbo char array on the stack and cast it to a struct that has all these aligned arrays:
struct tmp
{
short history[HIST_SIZE];
short history2[2*HIST_SIZE];
...
int energy[320];
...
};
char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
Something like that. Maybe not the most elegant, but it produced really good result and manual inspection of generated assembly proves that generated code is more or less adequate and acceptable. Build system was updated to use newer GCC and suddenly we started to have some artifacts in generated data (e.g. output from validation test suite is not bit exact anymore even in pure C build with disabled asm code). It took long time to debug the issue and it appeared to be related to aliasing rules and newer versions of GCC.
So, how can I get it done? Please, don't waste time trying to explain that it's not standard, not portable, undefined etc (I've read many articles about that). Also, there is no way I can change the code (I would perhaps consider modifying GCC as well to fix the issue, but not refactoring the code)... basically, all I want is to apply some black magic spell so that newer GCC produces the functionally same code for this type of code without disabling optimizations?
Edit:
In short, the point of the question... how can I allocate random amount of stack space (using char arrays or alloca
, and then align pointer to that stack space and reinterpret this chunk of memory as some structure that has some well defined layout that guarantees alignment of certain variables as long as the structure itself is aligned properly. I'm trying to cast the memory using all kinds of approaches, I move the big stack allocation to a separate function, still I get bad output and stack corruption, I'm really starting to think more and more that this huge function hits some kind of bug in gcc. It's quite strange, that by doing this cast I can't get this thing done no matter what I try. By the way, I disabled all optimizations that require any alignment, it's pure C-style code now, still I get bad results (non-bitexact output and occasional stack corruptions crashes). The simple fix that fixes it all, I write instead of:
char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
this code:
tmp buf;
tmp * X = &buf;
then all bugs disappear! The only problem is that this code doesn't do proper alignment for the arrays and will crash with optimizations enabled.
Interesting observation:
I mentioned that this approach works well and produces expected output:
tmp buf;
tmp * X = &buf;
In some other file I added a standalone noinline function that simply casts a void pointer to that struct tmp*:
struct tmp * to_struct_tmp(void * buffer32)
{
return (struct tmp *)buffer32;
}
Initially, I thought that if I cast alloc'ed memory using to_struct_tmp it will trick gcc to produce results that I expected to get, yet, it still produces invalid output. If I try to modify working code this way:
tmp buf;
tmp * X = to_struct_tmp(&buf);
then i get the same bad result! WOW, what else can I say? Perhaps, based on strict-aliasing rule gcc assumes that tmp * X
isn't related to tmp buf
and removed tmp buf
as unused variable right after return from to_struct_tmp? Or does something strange that produces unexpected result. I also tried to inspect generated assembly, however, changing tmp * X = &buf;
to tmp * X = to_struct_tmp(&buf);
produces extremely different code for the function, so, somehow that aliasing rule affects code generation big time.
Conclusion:
After all kinds of testing, I have an idea why possibly I can't get it to work no matter what I try. Based on strict type aliasing, GCC thinks that the static array is unused and therefore doesn't allocate stack for it. Then, local variables that also use stack are written to the same location where my tmp
struct is stored; in other words, my jumbo struct shares the same stack memory as other variables of the function. Only this could explain why it always results in the same bad result. -fno-strict-aliasing fixes the issue, as expected in this case.
First I'd like to say I'm definitely with you when you ask not to buzz about "standard violation", "implementation-dependent" and etc. Your question is absolutely legitimate IMHO.
Your approach to pack all the arrays within one struct
also makes sense, that's what I'd do.
It's unclear from the question formulation which "artifacts" do you observe. Is there any unneeded code generated? Or data misalignment? If the latter is the case - you may (hopefully) use things like STATIC_ASSERT
to ensure at compile-time that things are aligned properly. Or at least have some run-time ASSERT
at debug build.
As Eric Postpischil proposed, you may consider declaring this structure as global (if this is applicable for the case, I mean multi-threading and recursion are not an option).
One more point that I'd like to notice is so-called stack probes. When you allocate a lot of memory from the stack in a single function (more than 1 page to be exact) - on some platforms (such as Win32) the compiler adds an extra initialization code, known as stack probes. This may also have some performance impact (though likely to be minor).
Also, if you don't need all the 40 arrays simultaneously you may arrange some of them in a union
. That is, you'll have one big struct
, inside which some sub-structs
will be grouped into union
.
There are a number of issues here.
Alignment: There is little that requires 32-byte alignment. 16-byte alignment is beneficial for SIMD types on current Intel and ARM processors. With AVX on current Intel processors, the performance cost of using addresses that are 16-byte aligned but not 32-byte aligned is generally mild. There may be a large penalty for 32-byte stores that cross a cache line, so 32-byte alignment can be helpful there. Otherwise, 16-byte alignment may be fine. (On OS X and iOS, malloc
returns 16-byte aligned memory.)
Allocation in critical code: You should avoid allocating memory in performance critical code. Generally, memory should be allocated at the start of the program, or before performance critical work begins, and reused during the performance critical code. If you allocate memory before performance critical code begins, then the time it takes to allocate and prepare the memory is essentially irrelevant.
Large, numerous arrays on the stack: The stack is not intended for large memory allocations, and there are limits to its use. Even if you are not encountering problems now, apparently unrelated changes in your code in the future could interact with using a lot of memory on the stack and cause stack overflows.
Numerous arrays: 40 arrays is a lot. Unless these are all in use for different data at the same time, and necessarily so, you should seek to reuse some of the same space for different data and purposes. Using different arrays unnecessarily can cause more cache thrashing than necessary.
Optimization: It is not clear what you mean by saying that the “alignment mess confuses the optimizer and different register allocation slows down the function big time”. If you have multiple automatic arrays inside a function, I would generally expect the optimizer to know they are different, even if you derive pointers from the arrays by address arithmetic. E.g., given code such as a[i] = 3; b[i] = c[i]; a[i] = 4;
, I would expect the optimizer to know that a
, b
, and c
are different arrays, and therefore c[i]
cannot be the same as a[i]
, so it is okay to eliminate a[i] = 3;
. Perhaps an issue you have is that, with 40 arrays, you have 40 pointers to arrays, so the compiler ends up moving pointers into and out of registers?
In that case, reusing fewer arrays for multiple purposes might help reduce that. If you have an algorithm that is actually using 40 arrays at one time, then you might look at restructuring the algorithm so it uses fewer arrays at a time. If an algorithm has to point to 40 different places in memory, then you essentially need 40 pointers, regardless of where or how they are allocated, and 40 pointers is more than there are registers available.
If you have other concerns about optimization and register use, you should be more specific about them.
Aliasing and artifacts: You report there are some aliasing and artifact problems, but you do not give sufficient details to understand them. If you have one large char
array that you reinterpret as a struct containing all your arrays, then there is no aliasing within the struct. So it is not clear what issues you are encountering.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With