Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange behavior with optimizations enabled

Tags:

c

gcc

debugging

I have this small snippet of code (this is a minimal working example of the problem I have):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void xorBuffer(unsigned char* dst, unsigned char* src, int len)
{
    while (len != 0)
    {
        *dst ^= *src;
        dst++;
        src++;
        len--;
    }
}

int main()
{
    unsigned char* a = malloc(32);
    unsigned char* b = malloc(32);
    int t;

    memset(a, 0xAA, 32);
    memset(b, 0xBB, 32);

    xorBuffer(a, b, 32);

    printf("result = ");
    for (t = 0; t < 32; t++) printf("%.2x", a[t]);
    printf("\n");

    return 0;
}

This code is supposed to perform the exclusive-or of two 32-byte memory buffers (conceptually, this should do a = a ^ b). Since 0xAA ^ 0xBB = 0x11, it should print "11" thirty-two times.

My problem is, when I compile this under MinGW-GCC (Windows), this works perfectly in debug mode (no optimizations) but crashes with a SIGILL midway through the xorBuffer loop when optimizations starting from -O3 are enabled. Also, if I put a printf in the offending loop, it'll work perfectly again. I suspect stack corruption but I just don't see what I'm doing wrong here.

Trying to debug with GDB with optimizations enabled is a lost cause as all GDB shows me is "variable optimized out" for every variable (and, of course, if I try and printf a variable out, it'll suddenly work).

Does anybody know what the heck is going on here? I have spent too long dwelling on this issue, and I really need to fix it properly to move on. My guess is I am missing some fundamental C pointer knowledge, but to me the code looks correct. It could be from the buffer incrementation, but as far as I know, sizeof(unsigned char) == 1, so it should be going through each byte one by one.

For what it's worth, the code works even with optimizations on GCC on my Linux box.

So... what's the deal here? Thanks!

As requested, the assembly output of the whole program:

With -O2: clicky

With -O3: clicky

I observe this behavior on GCC 4.6.2 (running with MinGW)

like image 456
Thomas Avatar asked Aug 22 '12 09:08

Thomas


2 Answers

From my comment:

Make sure the compiler has the right information about the target architecture. It seems, from reading the -O3 output, that the compiler is setting you up the SIMD optimization, it's in effect making the code more parallel by using vector instructions (such as movdqa). If the target processor doesn't match 100% what the compiler is emitting code for, you might end up with illegal instructions.

like image 62
unwind Avatar answered Nov 06 '22 09:11

unwind


I am adding this as an extension of Unwind's answer (which I accept as it got me on the right track).

After sifting through the optimized code, I noticed AVX instructions. At first, I thought that it shouldn't cause an issue, considering my processor supports the AVX instruction set. However, it turns out that there are two distinct AVX versions: AVX1 and AVX2. And, while my processor only supports AVX1, gcc indiscriminately uses AVX2 opcodes as long as the processor supports any of the two versions (llvm made the same mistake, there are bug reports on that). This is, as far as I can conceive, incorrect operation, and a compiler bug.

The result is AVX2 code on an AVX1 system, which obviously leads to an illegal instruction. It explains many things, from the code not failing on inputs smaller than 32 bytes (because of the 256-bit register width), to the code working on my Linux box, which happens to be a virtual machine with CPU support limited to SSE3.

The fix is either to disable -O3 and go back to -O2, where gcc won't resort to the most hardcore SIMD instructions to optimize simple code, or to use the volatile keyword which will force it to go through the buffers byte per byte, painstakingly, like so:

*(unsigned char volatile *)dst ^= *(unsigned char volatile *)src;

This is of course very slow and probably worse than just using -O2 (ignoring whole-program repercussions), but it can be worked around by going through the buffer int by int instead and padding at the end, which is good enough in terms of speed.

Another good fix is to upgrade to a version of gcc which does not have this bug (this version may not exist yet, I have not checked).

EDIT: the ultimate fix is to throw the -mno-avx flag at GCC, thereby disabling any and all AVX opcodes, completely negating the bug with no code modifications (and can easily be removed once a patched compiler version is available).

What a perverse compiler bug.

like image 8
Thomas Avatar answered Nov 06 '22 08:11

Thomas