I am in the process of optimizing my code for my n-body simulator, and when profiling my code, have seen this:
These two lines,
float diffX = (pNode->CenterOfMassx - pBody->posX);
float diffY = (pNode->CenterOfMassy - pBody->posY);
Where pNode
is a pointer to a object of type Node
that I have defined, and contains (with other things) 2 floats, CenterOfMassx
and CenterOfMassy
Where pBody
is a pointer to a object of type Body
that I have defined, and contains (with other things) 2 floats, posX
and posY
.
Should take the same amount of time, but do not. In fact the first line accounts for 0.46% of function samples, but the second accounts for 5.20%.
Now I can see the second line has 3 instructions, and the first only has one.
My question is why do these seemingly do the same thing but in practice to different things?
As previously stated, profiler is listing only one assembly instruction with the first line, but three with the second line. However, because the optimizer can move code around a lot, this isn't very meaningful. It looks like the code is optimized to load all of the values into registers first, and then perform the subtractions. So it performs an action from the first line, then an action from the second line (the loads), followed by an action from the first line and an action from the second line (the subtractions). Since this is difficult to represent, it just does a best approximation of which line corresponds to which assembly code when displaying the disassembly inline with the code.
Take note that the first load is executed and may still be in the CPU pipeline when the next load instruction is executing. The second load has no dependence on the registers used in the first load. However, the first subtraction does. This instruction requires the previous load instruction to be far enough in the pipeline that the result can be used as one of the operands of the subtraction. This will likely cause a stall in the CPU while the pipeline lets the load finish.
All of this really reinforces the concept of memory optimizations being more important on modern CPUs than CPU optimizations. If, for example, you had loaded the required values into registers 15 instructions earlier, the subtractions might have occurred much quicker.
Generally the best thing you can do for optimizations is to keep the cache fresh with the memory you are going to using, and making sure it gets updated as soon as possible, and not right before the memory is needed. Beyond that, optimizations are complicated.
Of course, all of this is further complicated by modern CPUs that might look ahead 40-60 instructions for out of order execution.
To optimize this further, you might consider using a library that does vector and matrix operations in an optimized manner. Using one of these libraries, it might be possible to use two vector instructions instead of 4 scalar instructions.
According to the expanded assembly, the instructions are reordered to load pNodes
's data members before performing subtraction with pBody
's data members. The purpose may be to take advantage of memory caching.
Thus, the execution order is not the same as C code anymore. Comparing 1 movss which is accounted for the 1st C statement with 1 movss + 2 subss which are accounted for the 2nd C statement is unfair.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With