Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why would using 8 threads be faster than 4 threads on a 4 core Hyper Threaded CPU?

I have a quad core i7 920 CPU. It is Hyperthreaded, so the computer thinks it has 8 cores.

From what I've read on the interweb, when doing parallel tasks, I should use the number of physical cores, not the number of hyper threaded cores.

So I have done some timings, and was surprised that using 8 threads in a parallel loop is faster than using 4 threads.

Why is this? My example code is too long to post here, but can be found by running the example here: https://github.com/jsphon/MTVectorizer

A chart of the performance is here:

enter image description here

like image 882
Ginger Avatar asked Nov 23 '14 10:11

Ginger


1 Answers

(Intel) hyperthreaded cores act like (up to) two CPUs.

The observation is that a single CPU has a set of resources that are ideally busy continuously, but in practice sit idle surprising often while the CPU waits for some external event, typically memory reads or writes.

By adding a bit of additional state information for another hardware thread (e.g., another copy of the registers + additional stuff), the "single" CPU can switch its attention to executing the other thread when the first one blocks. (One can generalize this N hardware threads, and other architectures have done this; Intel quit at 2).

If both hardware threads spend their time waiting for various events, the CPU can arguably do the corresponding processing for the hardware threads. 40 nanoseconds for a memory wait is a long time. So if your program fetches lots of memory, I'd expect it to look as if both hardware threads were fully effective, e.g, you should get nearly 2x.

If the two hardware threads are doing work that is highly local (e.g., intense computations in just the registers), then internal waits become minimal and the single CPU can't switch fast enough to service both hardware threads as fast as they generate work. In this case, performance will degrade. I don't recall where I heard it, and I heard this a long time ago: under such circumstances the net effect is more like 1.3x than the idealized 2x. (Expecting the SO audience to correct me on this).

Your application may switch back and forth in its needs depending on which part is running at the moment. Then you will get a mix of performance. I'm happy with any speed up I can get.

like image 71
Ira Baxter Avatar answered Oct 05 '22 23:10

Ira Baxter