How much performance gain, if any, can one get from reordering x64 (x86-64) instructions on recent high-end Intel CPUs. Is it worth bothering with in extremely time-critical situations?
I was also wondering about the possibility of gains made by changing register usage / using additional registers (if free) in order to permit longer-distance code movement in certain odd cases?
Instruction scheduling isn't usually a big deal over short distances, because out-of-order execution usually works. It matters a lot more on in-order CPUs like some ARM cores, where scheduling loads well ahead of instructions that use the result is a big deal.
It can help some even on high-end x86, though, depending on what kind of bottleneck is limiting your execution throughput. See http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ for some interesting stuff about ROB size vs. number of physical registers being a limiting factor in out-of-order execution. Software-pipelining could help with long dependency chains that out-of-order execution has a hard time hiding.
Putting instructions on the critical path dependency chain early can help, because OOO scheduling typically tries to execute oldest-ready-first. (See How are x86 uops scheduled, exactly?).
Modern CPUs are complex beasts, and sometimes reordering things can make a difference when you wouldn't expect it to matter. Sometimes there's no way to guess exactly why it made a difference. Different ordering can affect front-end bandwidth in the decoders or even in the uop cache, since there are many rules about how decoded uops pack into up-to-6-uop lines in the uop cache (on Intel CPUs). For example, Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs
Sometimes the explanation is very obscure. For example, in Intel's optimization manual, Example 3-25. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions, they discuss overwriting the zero-latency-movzx result right away to free up the internal resource sooner. (I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the movzx instructions.)
Probably this also applies to eliminated mov
instructions, not just movzx
, but it might not.
IDK if I would have figured that out if I'd run into in a real optimization situation (for IvyBridge) if Intel's manual hadn't used it as an example. Performance counters for uops issued vs. executed (fused domain vs. unfused-domain) show how many mov uops are eliminated, but figuring out why it's happening would be nearly impossible without an optimization manual saying something about why. Reordering nearby independent instructions just to try things out can help as a last step in tuning, but at that point it's voodoo / black magic / guesswork.
As Margaret points out, there are reasons to reorder instructions other than simple scheduling. See Agner Fog's optimization and microarchitecture guides, and other resources in the x86 tag wiki to learn more.
For example, grouping cmp/jcc
and test/jcc
together is always a good idea because of macro-fusion. Your compiler will do this for you when you compile with -march=haswell
or something, because that enables -mtune=haswell
.
It can also open uop other optimization opportunities if it lets you avoid some mov
instructions or spill/reload, but that goes beyond just scheduling instructions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With