Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is HyperThreading / SMT a flawed concept?

The primary idea behind HT/SMT was that when one thread stalls, another thread on the same core can co-opt the rest of that core's idle time and run with it, transparently.

In 2013 Intel dropped SMT in favor of out-of-order execution for its Silvermont processor cores, as they found this gave better performance.

ARM no longer support SMT (for energy reasons). AMD never supported it. In the wild, we still have various processors that support it.

From my perspective, if data and algorithms are built to avoid cache misses and subsequent processing stalls at all costs, surely HT is a redundant factor in multi-core systems? While I appreciate that there is low overhead to the context-switching involved since the two HyperThreads' discrete hardware exists within the same physical core, I cannot see that this is better than no context switching at all.

I'm suggesting that any need for HyperThreading points to flawed software design. Is there anything I am missing here?

like image 521
Engineer Avatar asked Apr 15 '14 08:04

Engineer


2 Answers

  1. Not all programmers have enough knowledge, time and many other things to write efficient, cache-friendly programs. Most of the time only the critical parts are optimized when needed. The other parts may have lots of cache misses

  2. Even if the program was written with cache efficiency in mind, it may not eliminate cache misses completely. Cache availability is a dynamic information only known at runtime, and neither the programmer nor the compiler knows that to optimize memory access.

    • Cache unpredictability is one of the reasons the Itanium failed, because while the compilers can reorder arithmetic operations, it cannot guess those cache information in a multithreading environment to reorder memory loads/stores efficiently.
    • Each time there's a cache miss, hundreds of cycles are wasted which can be useful for other purposes. Some CPUs do out-of-order execution (OoO). But even OoO execution has its limits and you'll be blocked at some point. During those time while waiting for all memory problems to be solved, you can switch to another CPU thread and continue running.
  3. As Peter Cordes said, there are other unavoidable stalls like branch misprediction or simply low instruction-level parallelism where OoO doesn't help. There's no way to solve them before runtime

  4. It's not only Intel that uses SMT now. AMD Bulldozer has module multithreading which is a partial SMT. There are still lots of other architectures that use SMT such as SPARC, MIPS, PowerPC... There are even CPUs with 8 or 16 threads per core, like with 12-core 96-thread POWER8 CPUs or the SPARC T3

    AMD Module Multithreading

    https://en.wikipedia.org/wiki/Simultaneous_multithreading#Modern_commercial_implementations

Update:

AMD has moved to full SMT now in the Zen microarchitecture

zen SMT

like image 98
phuclv Avatar answered Oct 14 '22 23:10

phuclv


Whether hyper-threading helps and by how much very much depends on what the threads are doing. It isn't just about doing work in one thread while the other thread waits on I/O or a cache miss - although that is a big part of the rationale. It is about efficiently using the CPU resources to increase total system throughput. Suppose you have two threads

  1. one has lots of data cache misses (poor spatial locality) and does not use floating point, the poor spatial locality is not necessarily because the programmer didn't do a good job, some workloads are inherently so.
  2. another thread is streaming data from memory and doing floating point calculations

With hyper-threading these two threads can share the same CPU, one is doing integer operations and getting cache misses and stalling, the other is using the floating point unit and the data prefetcher is well ahead anticipating the sequential data from memory. The system throughput is better than if the O/S alternatively scheduled both threads on the same CPU core.

Intel chose not to include hyper-threading in Silvermont, but that doesn't mean it will do away with it in high end Xeon server processors, or even in processors targeted at laptops. Choosing the micro-architecture for a processor involves trade-offs, there are many considerations:

  1. What is the target market (what kind of applications will run)?
  2. What is the target transistor technology?
  3. What is the performance target?
  4. What is the power budget?
  5. What is the target die size (affects yield)?
  6. Where does it fit in the spectrum of price/performance for the company's future products?
  7. What is the target launch date?
  8. How many resources are available to implement and verify the design? Adding micro-architectural features adds complexity which is not linear, there are subtle interactions with other features and the goal is to identify as many bugs as possible before the first "tapeout" to minimize how many "steppings" have to be done before you have a working chip.

Silvermont's die size budget per core and power budget precluded having both out-of-order execution and hyperthreading, and out-of-order execution gives better single threaded performance. Here's Anandtech's assessment:

If I had to describe Intel’s design philosophy with Silvermont it would be sensible scaling. We’ve seen this from Apple with Swift, and from Qualcomm with the Krait 200 to Krait 300 transition. Remember the design rule put in place back with the original Atom: for every 2% increase in performance, the Atom architects could at most increase power by 1%. In other words, performance can go up, but performance per watt cannot go down. Silvermont maintains that design philosophy, and I think I have some idea of how.

Previous versions of Atom used Hyper Threading to get good utilization of execution resources. Hyper Threading had a power penalty associated with it, but the performance uplift was enough to justify it. At 22nm, Intel had enough die area (thanks to transistor scaling) to just add in more cores rather than rely on HT for better threaded performance so Hyper Threading was out. The power savings Intel got from getting rid of Hyper Threading were then allocated to making Silvermont an out-of-order design, which in turn helped drive up efficient use of the execution resources without HT. It turns out that at 22nm the die area Intel would’ve spent on enabling HT was roughly the same as Silvermont’s re-order buffer and OoO logic, so there wasn’t even an area penalty for the move.

like image 10
amdn Avatar answered Oct 14 '22 23:10

amdn