Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GCC generates redundant code for repeated XOR of an array element

GCC is giving me a hard time generating optimal assembly for following source code:

memset(X, 0, 16);
for (int i= 0; i < 16; ++i) {
    X[0] ^= table[i][Y[i]].asQWord;
}

X being an uint64_t[2] array, and
Y being an unsigned char[16] array, and
table being a double dimensional array of union qword_t:

union qword_t {
    uint8_t asBytes[8];
    uint64_t asQWord;
};

const union qword_t table[16][256] = /* ... */;

With options -m64 -Ofast -mno-sse it does unroll the loop, and each xor with assignment results in 3 instructions (thus overall number of instructions issued is 3 * 16 = 48):

movzx  r9d, byte ptr [Y + i]                   ; extracting byte
xor    rax, qword ptr [table + r9*8 + SHIFT]   ; xoring, SHIFT = i * 0x800
mov    qword ptr [X], rax                      ; storing result

Now, my understanding is that resulting X value could be accumulated in rax register throughout all 16 xors, and then it could be stored at [X] address, which could be achieved with these two instructions for each xor with assignment:

movzx  r9d, byte ptr [Y + i]                   ; extracting byte
xor    rax, qword ptr [table + r9*8 + SHIFT]   ; xoring, SHIFT = i * 0x800

and single storing:

mov    qword ptr [X], rax                      ; storing result

(In this case overall number of instructions is 2 * 16 + 1 = 33)

Why does GCC generate these redundant mov instructions? What can I do to avoid this?

P.S. C99, GCC 5.3.0, Intel Core i5 Sandy Bridge

like image 467
aprelev Avatar asked Apr 13 '16 08:04

aprelev


2 Answers

Redundant stores are usually down to aliasing; in this case gcc would be unable to prove to its satisfaction that the store to X[0] does not affect table. It makes a big difference how the variables are passed to the routine; if they are globals or members of the same larger struct then proving non-aliasing is easier.

Example:

void f1(uint64_t X[2]) {
  memset(X, 0, 16);
  for (int i= 0; i < 16; ++i) {
    X[0] ^= table[i][Y[i]].asQWord;
  }
}

uint64_t X[2];
void f2() {
  memset(X, 0, 16);
  for (int i= 0; i < 16; ++i) {
    X[0] ^= table[i][Y[i]].asQWord;
  }
}

Here the store to X[0] is sunk out of the loop in f2 but not in f1, because only in f2 can gcc prove that X does not alias members of table.

Your workaround/fix could be to adjust how the parameters are passed, to use the restrict specifier, or to manually sink the store yourself.

like image 103
ecatmur Avatar answered Nov 01 '22 05:11

ecatmur


To avoid this, you could use this instead:

uint64_t v = 0;
for (int i= 0; i < 16; ++i) {
    v ^= table[i][Y[i]].asQWord;
}
X[0] = v;
X[1] = 0;

You can easily notice the generated instructions are sub-optimal in your case, however for different reasons gcc may not be able to determine that. (And in this case, gcc cannot determine that table will never access the same memory-region as X, as ecatmur explains more elaborately.)

like image 24
Elijan9 Avatar answered Nov 01 '22 05:11

Elijan9