Why would a fully CPU bound process work better with hyperthreading?

Tags:

Given:

A fully CPU bound very large (i.e. more than a few CPU cycles) job, and
A CPU with 4 physical and total 8 logical cores,

is it possible that 8, 16 and 28 threads perform better than 4 threads? My understanding is that 4 threads would have lesser context switches to perform and will have lesser overhead in any sense than 8, 16 or 28 threads would have on a 4 physical core machine. However, the timings are -

Threads    Time Taken (in seconds)    4         78.82    8         48.58    16        51.35    28        52.10

The code used to test get the timings is mentioned in the Original Question section below. The CPU specifications are also given at the bottom.

After reading the answers that various users have provided and information given in the comments, I am able to finally boil down the question to what I wrote above. If the question above gives you the complete context, you can skip the original question below.

Original Question

What does it mean when we say

Hyper-threading works by duplicating certain sections of the processor—those that store the architectural state—but not duplicating the main execution resources. This allows a hyper-threading processor to appear as the usual "physical" processor and an extra "logical" processor to the host operating system

This question is asked on SO today and it basically tests the performance of multiple threads doing the same work. It has the following code:

private static void Main(string[] args) {     int threadCount;     if (args == null || args.Length < 1 || !int.TryParse(args[0], out threadCount))         threadCount = Environment.ProcessorCount;      int load;     if (args == null || args.Length < 2 || !int.TryParse(args[1], out load))         load = 1;      Console.WriteLine("ThreadCount:{0} Load:{1}", threadCount, load);     List<Thread> threads = new List<Thread>();     for (int i = 0; i < threadCount; i++)     {         int i1 = i;         threads.Add(new Thread(() => DoWork(i1, threadCount, load)));     }      var timer = Stopwatch.StartNew();     foreach (var thread in threads) thread.Start();     foreach (var thread in threads) thread.Join();     timer.Stop();      Console.WriteLine("Time:{0} seconds", timer.ElapsedMilliseconds/1000.0); }  static void DoWork(int seed, int threadCount, int load) {     var mtx = new double[3,3];     for (var i = 0; i < ((10000000 * load)/threadCount); i++)     {          mtx = new double[3,3];          for (int k = 0; k < 3; k++)             for (int l = 0; l < 3; l++)               mtx[k, l] = Math.Sin(j + (k*3) + l + seed);      } }

(I have cut out a few braces to bring the code in a single page for quick readability.)

I ran this code on my machine for replicating the issue. My machine has 4 physical cores and 8 logical ones. The method DoWork() in the code above is completely CPU bound. I felt that hyper-threading could contribute to maybe a 30% speedup (because here we have as many CPU bound threads as the physical cores (i.e. 4)). But it nearly does attain 64% performance gain. When I ran this code for 4 threads, it took about 82 seconds and when I ran this code for 8, 16 and 28 threads, it ran in all the cases in about 50 seconds.

To summarize the timings:

Threads    Time Taken (in seconds)    4         78.82    8         48.58    16        51.35    28        52.10

I could see that CPU usage was ~50% with 4 threads. Shouldn't it be ~100%? After all my processor has only 4 physical cores. And the CPU usage was ~100% for 8 and 16 threads.

If somebody can explain the quoted text at the start, I hope to understand hyperthreading better with it and in turn hope to get the answer to Why would a fully CPU bound process work better with hyperthreading?.

For the sake of completion,

I have Intel Core i7-4770 CPU @ 3.40 GHz, 3401 MHz, 4 Core(s), 8 Logical Processor(s).
I ran the code in Release mode.
I know that the way timings are measured is bad. This will only give the time for slowest thread. I took the code as it is from the other question. However, what is the justification for 50% CPU usage when running 4 CPU bound threads on a 4 physical core machine?

650

asked Sep 11 '15 19:09

displayName

1 Answers

CPU pipeline

Each instruction has to go through several steps in the pipeline to be fully executed. At the very least, it must be decoded, sent to execution unit, then actually executed there. There are several execution units on modern CPUs, and they can execute instructions completely in parallel. By the way, the execution units are not interchangeable: some operations can only be done on a single execution unit. For example, memory loads are usually specialized to one or two units, memory stores are exclusively sent to another unit, all the calculations are done by some other units.

Knowing about the pipeline, we may wonder: how can CPU work so fast, if we write purely sequental code and each instruction has to go through so many pipeline stages? Here is the answer: processor executes instructions in out-of-order fashion. It has a large reorder buffer (e.g. for 200 instructions), and it pushes many instructions through its pipeline in parallel. If at any moment some instruction cannot be executed for any reason (waits for data from slow memory, depends on other instruction not yet finished, whatever), then it is delayed for some cycles. During this time processor executes some new instructions, which are located after the delayed instructions in our code, given that they do not depend on the delayed instructions in any way.

Now we can see the problem of latency. Even if an instruction is decoded and all of its inputs are already available, it would take it several cycles to be executed completely. This delay is called instruction latency. However, we know that at this moment processor can execute many other independent instructions, if there are any.

If an instruction loads data from L2 cache, it has to wait about 10 cycles for the data to be loaded. If the data is located only in RAM, then it would take hundreds of cycles to load it to processor. In this case we can say that the instruction has high latency. It is important for maximum performance to have some other independent operations to execute at this moment. This is sometimes called latency hiding.

At the very end, we have to admit that most of real code is sequental in its nature. It has some independent instructions to execute in parallel, but not too many. Having no instructions to execute causes pipeline bubbles, and it leads to inefficient usage of processor's transistors. On the other hand, instructions of two different threads are automatically independent in almost all cases. This leads us directly to the idea of hyper-threading.

P.S. You might want to read Agner Fog's manual to better understand internals of modern CPUs.

Hyper-threading

When two threads are executed in hyper-threading mode on a single core, the processor can interleave their instructions, allowing to fill bubbles from the first thread with instructions of the second thread. This allows to better utilize processor's resources, especially in case of ordinary programs. Note that HT may help not only when you have a lot of memory accesses, but also in heavily sequental code. A well-optimized computational code may fully utilize all resources of CPU, in which case you will see no profit from HT (e.g. dgemm routine from well-optimized BLAS).

P.S. You might want to read Intel's detailed explanation of hyper-threading, including info about which resources are duplicated or shared, and discussion about performance.

Context switches

The context is an internal state of CPU, which at least includes all the registers. When execution thread changes, OS has to do a context switch (detailed description here). According to this answer, context switch takes about 10 microseconds, while the time quant of scheduler is 10 milliseconds or more (see here). So context switches do not affect total time much, because they are done seldom enough. Note that competition for CPU caches between threads can increase the effective cost of switches in some cases.

However, in case of hyper-threading each core has two states internally: two sets of registers, shared caches, one set of execution units. As a result, the OS has no need to do any context switches when you run 8 threads on 4 physical cores. When you run 16 threads on quad-core, the context switches are performed, but they take small part of the overall time, as explained above.

Process manager

Speaking of CPU utilization that you see in the process manager, it does not measure the internals of CPU pipeline. Windows can only notice when a thread returns execution to OS in order to: sleep, wait for mutex, wait for HDD, and do other slow things. As a result, it thinks that a core is fully used if there is a thread working on it, which does not sleep or wait for anything. For instance, you may check that running endless loop while (true) {} leads to full utilization of CPU.

answered Oct 15 '22 02:10

stgatilov

Related questions
                            
                                Where does the owner *name* for an S3 bucket/AWS account come from?
                            
                                How do I provide files for download?
                            
                                Why use public methods in JavaScript objects? [duplicate]
                            
                                AMP browser support?
                            
                                How to know if I am using Open JDK or Oracle JDK?
                            
                                Safely assign value to nested hash using Hash#dig or Lonely operator(&.)
                            
                                Grab the return value and get out of forEach in JavaScript? [duplicate]
                            
                                C++ postfix expression undefined vs unspecified behaviour
                            
                                Visual studio code interactive python console
                            
                                How to run bash script file in Airflow
                            
                                Understanding `tf.nn.nce_loss()` in tensorflow
                            
                                Run interactive Bash with popen and a dedicated TTY Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With