Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can using char[] as parameters, return, and etc do any performance issues?

Tags:

c++

gcc

First to make the C++ code more readable; I am programming compiler, and I gave it:

var swap = ( int x, y ) => { //Assign method that returns two ints, and gets two ints as parameter to variable named swap.
    var NewX = y
    var NewY = x
}
var increment = ( int x ) => {
    var Result = x + 1
}

NOTE: Functions return any variable that it's first letter is capitalized. swap can be used like ... = swap( x, y ).NewX, but increment can be used as just ... = increment( x ).

After some optimization it generated: (Made swap and increment actual function instead of variables, and optimized swap's stack)

template<int BytesCount> struct rawdata { //struct from some header
    char _[ BytesCount ];
    inline char &operator[] (int index) {
        return _[ index ];
    }
};

//...

rawdata<8> generatedfunction0( rawdata<8> p ) { // var swap = ( int x, y ) => {
    return{ p[ 4 ], p[ 5 ], p[ 6 ], p[ 7 ], p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] };
}
rawdata<4> generatedfunction1( rawdata<4> p ) { // var increment = ( int x ) => {
    rawdata<4> r = { p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] };
    ++*( ( int* )&r[ 0 ] );
    return r;
}

I am almost sure that ++*( ( int* )&r[ 0 ] ); won't do useless indirection, but how about return{ p[ 4 ], p[ 5 ], p[ 6 ], p[ 7 ], p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] };? Is there any source that guarantees that it will optimized it as if it was two ints that being put into array instead of 8 or more instructions that put byte by byte? I am not talking about this particular case only, but anything similar.

If it depends, then I am using GCC to compile the generated code.

like image 357
LyingOnTheSky Avatar asked Sep 29 '22 09:09

LyingOnTheSky


1 Answers

Yes, it can harm performances - but not always. The problem is in the explicit access of the individual bytes.

An "intelligent" compiler would recognise that you access contiguous memory and try to optimize it. However for some reason that doesn't work at all with gcc, clang or icc (haven't test msvc). There is still room for improvement for compiler optimizers, and IIRC the standard does not mandate any optimization.

Swap:

So, let's process each function, starting with swap. I added 2 more functions for completeness, see after the code snippet:

#include <stdint.h>

rawdata<8> genSWAP(rawdata<8> p)
{
    return { p[ 4 ], p[ 5 ], p[ 6 ], p[ 7 ], p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] };
}

rawdata<8> genSWAPvar(rawdata<8> p)
{
    return { p._[ 4 ], p._[ 5 ], p._[ 6 ], p._[ 7 ], p._[ 0 ], p._[ 1 ], p._[ 2 ], p._[ 3 ] };
}

rawdata<8> genSWAP32(rawdata<8> p)
{
    rawdata<8> res = p;
    uint32_t* a = (uint32_t*)&res[0];
    uint32_t* b = (uint32_t*)&res[4];
    uint32_t tmp = *a;
    *a = *b;
    *b = tmp;
    return res;
}
  • genSWAP : your function
  • genSWAPvar : the same as yours but without using the operator[] you defined
  • genSWAP32 : explicitely packing your bytes 32 bits per 32 bits

You can view the generated asm here.

genSWAP and genSWAPvar are no different, meaning that the overloaded operator[] is just optimized out. However each byte is accessed in memory individually, and also processed individually. This is bad since on 32 bit architectures the processor load 4 bytes at once from memory (8 for 64 bit architectures). So in short gcc/clang/icc are emitting instructions to counter the real possibilities of 32bits architectures...

genSWAP32 is way more efficient, doing the minimum number of loads (for 32bits), and correctly using the registers (note that for 64bits architectures it should be possible to do only one load instead of 2).

And finally, some real measures: on Ideone genSWAP32 is almost 4x faster (which make sense, because it has 2 loads instead of 8 and less computing instructions).

Increment:

Same here, your function vs "optimized" one:

rawdata<4> genINC(rawdata<4> p)
{
    rawdata<4> r = { p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] };
    ++*( ( int* )&r[ 0 ] );
    return r;
}

rawdata<4> genINC32(rawdata<4> p)
{
    rawdata<4> res = p;
    uint32_t* a = (uint32_t*)&res[0];
    ++*a;
    return res;
}

The generated asm is here.

For clang and icc, the killer is not the increment but the initialization, where you access each byte individually. gcc and icc probably do this by default because the order of bytes could be different that 0 1 2 3. Surprisingly, clang recognize that the order of bytes and optimizes this correctly - no perf difference.

Then something interesting happens: the genINC32 function is slower on gcc, but faster on msvc (*I don't see a permalink button on rise4fun, so go there and paste the code tested on ideone). Without seeing the msvc generated assembler and comparing I have no explanation for that.

In conclusion, while it is possible to have a compiler correctly optimize all your code, don't rely on that now, so don't access every byte individually if not needed.

like image 144
Synxis Avatar answered Oct 04 '22 02:10

Synxis