In my program code there are various fairly small objects ranging from a byte or 2 upto about 16. E.g. Vector2 (2 * T), Vector3 (3 * T), Vector4 (4 * T), ColourI32 (4), LightValue16 (2), Tile (2), etc (byte size in brackets).
Was doing some profiling (sample based) which led me to some slower than expected functions, e.g.
//4 bits per channel natural light and artificial RGB
class LightValue16
{
...
explicit LightValue16(uint16_t value);
LightValue16(const LightValueF &);
LightValue16(int r, int g, int b, int natural);
int natural()const;
void natural(int v);
int artificialRed()const;
...
uint16_t data;
};
...
LightValue16 World::getLight(const Vector3I &pos)
{ ... }
This function does some maths to lookup the value via a couple of arrays, with some default values for above the populated part of the world. The contents are inlined nicely and looking at the disassembly looks about as good as it can get.with about 100 instructions. However one thing stood out, on all the return sites it was implemented with something like:
mov eax, dword pyt [ebp + 8]
mov cx, word ptr[ecx + edx * 2] ; or say mov ecx, Fh
mov word ptr [eax], cx
pop ebp
ret 10h
For x64 I saw pretty much the same thing. I didn't check my GCC build, but I suspect it does pretty much the same thing.
I did a little experimenting and found by using a uint16_t return type. It actually resulted in the World::getLight function getting inlined (looked like pretty much the same core 80 instructions or so, no cheats with conditionals/loops being different) and the total CPU usage for the outer function I was investigating to go from 16.87% to 14.04% While I can do that on a case by case bases (along with trying the force inline stuff I suppose), is there any practical ways to avoid such performance issues to start with? Perhaps even get a couple of % faster across the entire code?
The best I can think of just now is to just use the primitive types in such cases ( < 4 or perhaps 8 byte objects) and move all the current member stuff into non member functions, so more like as done in C, just with namespaces.
Thinking about this I guess there is also often a cost to stuff like "t foo(const Vector3F &p)" over "t foo(float x, float y, float z)"? And if so, over a program extensively using the const&, could it add up to a significant difference?
Take a look at the Itanium C++ ABI. While your computer definitely has no Itanium processor, gcc models the x86 and x86-64 ABI very similar to the Itanium ABI. The linked section states that
However, if the return value type has a non-trivial copy constructor or destructor, [return into caller-provided memory happens]
To find out what non-trivial copy constructor or destructor means, take a look into What are Aggregates and PODs and how/why are they special?, and peek at the rules for a class to be "trivially copyable". In your case, the problem is the copy constructor you defined. It should not be needed at all, the compiler will synthesize a copy constructor that just assigns the data
member as needed. If you want to explicitly state that you want a copy constructor, and you are using C++11, you can also write it down as defaulted function, which does not make it non-trivial:
LigthValue16(const LightValue16 & other) = default;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With