GCC is giving me a hard time generating optimal assembly for following source code:
memset(X, 0, 16);
for (int i= 0; i < 16; ++i) {
X[0] ^= table[i][Y[i]].asQWord;
}
X
being an uint64_t[2]
array, andY
being an unsigned char[16]
array, andtable
being a double dimensional array of union qword_t
:
union qword_t {
uint8_t asBytes[8];
uint64_t asQWord;
};
const union qword_t table[16][256] = /* ... */;
With options -m64 -Ofast -mno-sse
it does unroll the loop, and each xor with assignment results in 3 instructions (thus overall number of instructions issued is 3 * 16 = 48):
movzx r9d, byte ptr [Y + i] ; extracting byte
xor rax, qword ptr [table + r9*8 + SHIFT] ; xoring, SHIFT = i * 0x800
mov qword ptr [X], rax ; storing result
Now, my understanding is that resulting X value could be accumulated in rax
register throughout all 16 xors, and then it could be stored at [X]
address, which could be achieved with these two instructions for each xor with assignment:
movzx r9d, byte ptr [Y + i] ; extracting byte
xor rax, qword ptr [table + r9*8 + SHIFT] ; xoring, SHIFT = i * 0x800
and single storing:
mov qword ptr [X], rax ; storing result
(In this case overall number of instructions is 2 * 16 + 1 = 33)
Why does GCC generate these redundant mov
instructions? What can I do to avoid this?
P.S. C99, GCC 5.3.0, Intel Core i5 Sandy Bridge
Redundant stores are usually down to aliasing; in this case gcc would be unable to prove to its satisfaction that the store to X[0]
does not affect table
. It makes a big difference how the variables are passed to the routine; if they are globals or members of the same larger struct then proving non-aliasing is easier.
Example:
void f1(uint64_t X[2]) {
memset(X, 0, 16);
for (int i= 0; i < 16; ++i) {
X[0] ^= table[i][Y[i]].asQWord;
}
}
uint64_t X[2];
void f2() {
memset(X, 0, 16);
for (int i= 0; i < 16; ++i) {
X[0] ^= table[i][Y[i]].asQWord;
}
}
Here the store to X[0]
is sunk out of the loop in f2
but not in f1
, because only in f2
can gcc prove that X
does not alias members of table
.
Your workaround/fix could be to adjust how the parameters are passed, to use the restrict
specifier, or to manually sink the store yourself.
To avoid this, you could use this instead:
uint64_t v = 0;
for (int i= 0; i < 16; ++i) {
v ^= table[i][Y[i]].asQWord;
}
X[0] = v;
X[1] = 0;
You can easily notice the generated instructions are sub-optimal in your case, however for different reasons gcc may not be able to determine that. (And in this case, gcc cannot determine that table will never access the same memory-region as X, as ecatmur explains more elaborately.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With