At the bottom of Demystifying the Restrict Keyword is this curious advice:
Due to the order in which scheduling is done in GCC, it is always better to simplify expressions. Do not mix memory access with calculations. The code can be re-written as follows:
then there is an example which is essentially transforming this
velocity_x[i] += acceleration_x[i] * time_step;
into this
const float ax = acceleration_x[i]; // Then the same follows for y, z const float vx = velocity_x[i]; // etc for y, z const float nvx = vx + ( ax * time_step ); // etc velocity_x[i] = nvx; // ...
Really? I would have thought this sort of transformation was trivial compared to other stuff optimising compilers have to do, such as lambda arguments to std::foreach
and so on.
Is this just stale, silly advice? Or is there a good reason why GCC can't or won't do this? (It makes me worry about writing the above as velocity += acceleration * time_step
using my Vector3f
class!
Using -march=native enables all instruction subsets supported by the local machine (hence the result might not run on different machines). Using -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set. A generic CPU with 64-bit extensions.
Turning on optimization flags makes the compiler attempt to improve the performance and/or code size at the expense of compilation time and possibly the ability to debug the program. The compiler performs optimization based on the knowledge it has of the program.
What is Link Time Optimization (LTO) Link Time Optimization is a form of interprocedural optimization that is performed at the time of linking application code. Without LTO, Arm® Compiler for Linux compiles and optimizes each source file independently of one another, then links them to form the executable.
Edit: (I'm removing details about restrict
because it deviates from the actual question being asked and is causing confusion. The OP is assuming restict
is used.)
The transformation in your question is indeed trivial for an optimizing compiler, but that is not what Acton's paper is suggesting.
Here is the transformation done in the paper:
This code...
for (size_t i=0;i<count*stride;i+=stride) { velocity_x[i] += acceleration_x[i] * time_step; velocity_y[i] += acceleration_y[i] * time_step; velocity_z[i] += acceleration_z[i] * time_step; position_x[i] += velocity_x[i] * time_step; position_y[i] += velocity_y[i] * time_step; position_z[i] += velocity_z[i] * time_step; }
... was transformed into this code:
for (size_t i=0;i<count*stride;i+=stride) { const float ax = acceleration_x[i]; const float ay = acceleration_y[i]; const float az = acceleration_z[i]; const float vx = velocity_x[i]; const float vy = velocity_y[i]; const float vz = velocity_z[i]; const float px = position_x[i]; const float py = position_y[i]; const float pz = position_z[i]; const float nvx = vx + ( ax * time_step ); const float nvy = vy + ( ay * time_step ); const float nvz = vz + ( az * time_step ); const float npx = px + ( vx * time_step ); const float npy = py + ( vy * time_step ); const float npz = pz + ( vz * time_step ); velocity_x[i] = nvx; velocity_y[i] = nvy; velocity_z[i] = nvz; position_x[i] = npx; position_y[i] = npy; position_z[i] = npz; }
What is the optimization?
The optimization is not - as suggested - the separation of 1 expression into 3 expressions.
The optimization is the insertion of useful instructions between the instructions that operate on any particular piece of data.
If you follow the data moving from velocity_x[i]
to vx
to nvx
back to velocity_x[i]
, the CPU is doing other work between each of those steps.
Why is this an optimization?
Modern CPUs typically have a pipelined architecture.
Since instructions are executed in phases, the CPU allows multiple instructions to be processed at the same time. However, when an instruction requires the result of another instruction that hasn't been fully executed, this pipeline is stalled. No further instructions are executed until the stalled instruction can run.
Why isn't my optimizing compiler doing this automatically?
Some do.
GCC stands out as being relatively poor with this optimization.
I disassembled both loops above using gcc 4.7 (x86-64 architecture, optimization at -O3). Similar assembly was produced, but the order of the instructions was different and the first version produced significant stalls where a single float would be loaded, changed, and stored within the span of a few instructions.
You can read a little about gcc's instruction scheduling here, or just search the web for gcc instruction scheduling to see a lot of frustrated articles about this issue.
In my opinion, stale/silly advice. I mean that level of detail is specific to the compiler, compiler version, processor, processor version, etc. I'd stick with readability and let the compiler do its job. If someone is that worried about a possible clock cycle or two in a certain target, write some #def assembly for that target and leave the higher-level code there for other targets and reference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With