While playing around with overclocking and running burn tests, I noticed that the AVX-optimized version of LINPACK measured lower multithreaded floating-point throughput when Hyperthreading was enabled than with it disabled. This was on an Ivy Bridge i7 (3770k). I also noticed that with Hyperthreading disabled LINPACK resulted in higher core temperatures, despite me running the CPU at a lower core voltage. All this leads me to believe that without Hyperthreading, pipeline utilization is actually higher.
I'm curious: is this just something intrinsic to LINPACK's algorithm that causes pipeline stalls or inefficient allocation with SMT, or does Intel's SMT implementation legitimately have trouble scheduling the pipelines when both threads are issuing wide SIMD instructions? If so, is that something that Haswell has solved, or that will be solved in future Intel architectures? Is this a problem AVX512 is prone to have?
Finally, are there any good steps that can be taken when programming using AVX for Intel systems that would avoid inefficient pipeline allocation with SMT?
Hyperthreading shares the out-of-order-execution resources between two hardware threads, instead of giving them all to one thread. Normally you'd expect at worst to see no speedup, if one thread could already keep the pipeline full. Either way, the execution units should be chewing through 4 uops / clock of instructions that need to run.
If each thread works on its own chunk of memory, the CPU cores are then trying to juggle more live data at the same time. Competitive sharing of the L1 / L2 caches means this can end up being worse than without HT.
Also, some workloads have overhead to parallelize. Only embarrasingly-parallel problems (like doing many independent matmuls, rather than parallelizing one big one) have negligible overhead for coordinating the threading.
As Agner Fog mentions in his Optimizing Assembly manual, if any of the competitively-shared or partitioned CPU resources are the bottleneck, hyperthreading won't help, and can hurt. It's excellent when code spends a lot of time on branch mispredicts or cache misses, since the other thread can keep the hungry pipeline from sitting idle.
Matrix math has predictable enough access patterns that cache misses and mispredicts are rare. (esp. in code that's carefully blocked for cache sizes.)
How to avoid having HT not help: make your code slow, so a single thread can't execute it efficiently enough to keep the pipeline full. >.<. Seriously though, if there's an algorithm with cache misses or branch mispredicts that performs equally compared to a brute-force method, using it might help. e.g. tests for early-outs might be nearly a wash given the overhead of the branch mispredict on a single thread, but could be a lot faster when your code's running on two HW threads of the same core, so the brute-force way is at a disadvantage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With