First to make the C++ code more readable; I am programming compiler, and I gave it:
var swap = ( int x, y ) => { //Assign method that returns two ints, and gets two ints as parameter to variable named swap.
var NewX = y
var NewY = x
}
var increment = ( int x ) => {
var Result = x + 1
}
NOTE: Functions return any variable that it's first letter is capitalized. swap
can be used like ... = swap( x, y ).NewX
, but increment
can be used as just ... = increment( x )
.
After some optimization it generated: (Made swap
and increment
actual function instead of variables, and optimized swap
's stack)
template<int BytesCount> struct rawdata { //struct from some header
char _[ BytesCount ];
inline char &operator[] (int index) {
return _[ index ];
}
};
//...
rawdata<8> generatedfunction0( rawdata<8> p ) { // var swap = ( int x, y ) => {
return{ p[ 4 ], p[ 5 ], p[ 6 ], p[ 7 ], p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] };
}
rawdata<4> generatedfunction1( rawdata<4> p ) { // var increment = ( int x ) => {
rawdata<4> r = { p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] };
++*( ( int* )&r[ 0 ] );
return r;
}
I am almost sure that ++*( ( int* )&r[ 0 ] );
won't do useless indirection, but how about return{ p[ 4 ], p[ 5 ], p[ 6 ], p[ 7 ], p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] };
? Is there any source that guarantees that it will optimized it as if it was two ints that being put into array instead of 8 or more instructions that put byte by byte? I am not talking about this particular case only, but anything similar.
If it depends, then I am using GCC to compile the generated code.
Yes, it can harm performances - but not always. The problem is in the explicit access of the individual bytes.
An "intelligent" compiler would recognise that you access contiguous memory and try to optimize it. However for some reason that doesn't work at all with gcc, clang or icc (haven't test msvc). There is still room for improvement for compiler optimizers, and IIRC the standard does not mandate any optimization.
Swap:
So, let's process each function, starting with swap. I added 2 more functions for completeness, see after the code snippet:
#include <stdint.h>
rawdata<8> genSWAP(rawdata<8> p)
{
return { p[ 4 ], p[ 5 ], p[ 6 ], p[ 7 ], p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] };
}
rawdata<8> genSWAPvar(rawdata<8> p)
{
return { p._[ 4 ], p._[ 5 ], p._[ 6 ], p._[ 7 ], p._[ 0 ], p._[ 1 ], p._[ 2 ], p._[ 3 ] };
}
rawdata<8> genSWAP32(rawdata<8> p)
{
rawdata<8> res = p;
uint32_t* a = (uint32_t*)&res[0];
uint32_t* b = (uint32_t*)&res[4];
uint32_t tmp = *a;
*a = *b;
*b = tmp;
return res;
}
genSWAP
: your functiongenSWAPvar
: the same as yours but without using the operator[]
you definedgenSWAP32
: explicitely packing your bytes 32 bits per 32 bitsYou can view the generated asm here.
genSWAP
and genSWAPvar
are no different, meaning that the overloaded operator[]
is just optimized out. However each byte is accessed in memory individually, and also processed individually. This is bad since on 32 bit architectures the processor load 4 bytes at once from memory (8 for 64 bit architectures). So in short gcc/clang/icc are emitting instructions to counter the real possibilities of 32bits architectures...
genSWAP32
is way more efficient, doing the minimum number of loads (for 32bits), and correctly using the registers (note that for 64bits architectures it should be possible to do only one load instead of 2).
And finally, some real measures: on Ideone genSWAP32
is almost 4x faster (which make sense, because it has 2 loads instead of 8 and less computing instructions).
Increment:
Same here, your function vs "optimized" one:
rawdata<4> genINC(rawdata<4> p)
{
rawdata<4> r = { p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] };
++*( ( int* )&r[ 0 ] );
return r;
}
rawdata<4> genINC32(rawdata<4> p)
{
rawdata<4> res = p;
uint32_t* a = (uint32_t*)&res[0];
++*a;
return res;
}
The generated asm is here.
For clang and icc, the killer is not the increment but the initialization, where you access each byte individually. gcc and icc probably do this by default because the order of bytes could be different that 0 1 2 3
. Surprisingly, clang recognize that the order of bytes and optimizes this correctly - no perf difference.
Then something interesting happens: the genINC32
function is slower on gcc, but faster on msvc (*I don't see a permalink button on rise4fun, so go there and paste the code tested on ideone). Without seeing the msvc generated assembler and comparing I have no explanation for that.
In conclusion, while it is possible to have a compiler correctly optimize all your code, don't rely on that now, so don't access every byte individually if not needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With