Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multithreading: What is the point of more threads than cores?

People also ask

Is it better to have more threads or cores?

Cores increase the amount of work accomplished at a time, whereas threads improve throughput, computational speed-up. Cores is an actual hardware component whereas thread is a virtual component that manages the tasks. Cores use content switching while threads use multiple CPUs for operating numerous processes.

What is the advantage of having more threads?

On a multiprocessor system, multiple threads can concurrently run on multiple CPUs. Therefore, multithreaded programs can run much faster than on a uniprocessor system. They can also be faster than a program using multiple processes, because threads require fewer resources and generate less overhead.

Is it better to have more threads or less?

A general rule of thumb is that more physical cores are better than more threads. So if you were comparing a processors that had 4 cores and 4 threads, would be better than 2 cores 4 threads.

Do you need multiple cores for multithreading?

Multithreading refers to a program that can take advantage of a multicore computer by running on more than one core at the same time.


The answer revolves around the purpose of threads, which is parallelism: to run several separate lines of execution at once. In an 'ideal' system, you would have one thread executing per core: no interruption. In reality this isn't the case. Even if you have four cores and four working threads, your process and it threads will constantly be being switched out for other processes and threads. If you are running any modern OS, every process has at least one thread, and many have more. All these processes are running at once. You probably have several hundred threads all running on your machine right now. You won't ever get a situation where a thread runs without having time 'stolen' from it. (Well, you might if it's running real-time, if you're using a realtime OS or, even on Windows, use a real-time thread priority. But it's rare.)

With that as background, the answer: Yes, more than four threads on a true four-core machine may give you a situation where they 'steal time from each other', but only if each individual thread needs 100% CPU. If a thread is not working 100% (as a UI thread might not be, or a thread doing a small amount of work or waiting on something else) then another thread being scheduled is actually a good situation.

It's actually more complicated than that:

  • What if you have five bits of work that all need to be done at once? It makes more sense to run them all at once, than to run four of them and then run the fifth later.

  • It's rare for a thread to genuinely need 100% CPU. The moment it uses disk or network I/O, for example, it may be potentially spend time waiting doing nothing useful. This is a very common situation.

  • If you have work that needs to be run, one common mechanism is to use a threadpool. It might seem to make sense to have the same number of threads as cores, yet the .Net threadpool has up to 250 threads available per processor. I'm not certain why they do this, but my guess is to do with the size of the tasks that are given to run on the threads.

So: stealing time isn't a bad thing (and isn't really theft, either: it's how the system is supposed to work.) Write your multithreaded programs based on the kind of work the threads will do, which may not be CPU-bound. Figure out the number of threads you need based on profiling and measurement. You may find it more useful to think in terms of tasks or jobs, rather than threads: write objects of work and give them to a pool to be run. Finally, unless your program is truly performance-critical, don't worry too much :)


Just because a thread exists doesn't always mean it's actively running. Many applications of threads involve some of the threads going to sleep until it's time for them to do something - for instance, user input triggering threads to wake up, do some processing, and go back to sleep.

Essentially, threads are individual tasks that can operate independently of one another, with no need to be aware of the progress of another task. It's quite possible to have more of these than you have ability to run simultaneously; they're still useful for convenience even if they sometimes have to wait in line behind one another.


The point is that, despite not getting any real speedup when thread count exceeds core count, you can use threads to disentangle pieces of logic that should not have to be interdependent.

In even a moderately complex application, using a single thread try to do everything quickly makes hash of the 'flow' of your code. The single thread spends most of its time polling this, checking on that, conditionally calling routines as needed, and it becomes hard to see anything but a morass of minutiae.

Contrast this with the case where you can dedicate threads to tasks so that, looking at any individual thread, you can see what that thread is doing. For instance, one thread might block waiting on input from a socket, parse the stream into messages, filter messages, and when a valid message comes along, pass it off to some other worker thread. The worker thread can work on inputs from a number of other sources. The code for each of these will exhibit a clean, purposeful flow, without having to make explicit checks that there isn't something else to do.

Partitioning the work this way allows your application to rely on the operating system to schedule what to do next with the cpu, so you don't have to make explicit conditional checks everywhere in your application about what might block and what's ready to process.


If a thread is waiting for a resource (such as loading a value from RAM into a register, disk I/O, network access, launch a new process, query a database, or wait for user input), the processor can work on a different thread, and return to the first thread once the resource is available. This reduces the time the CPU spends idle, as the CPU can perform millions of operations instead of sitting idle.

Consider a thread that needs to read data off a hard drive. In 2014, a typical processor core operates at 2.5 GHz and may be able to execute 4 instructions per cycle. With a cycle time of 0.4 ns, the processor can execute 10 instructions per nanosecond. With typical mechanical hard drive seek times are around 10 milliseconds, the processor is capable of executing 100 million instructions in the time it takes to read a value from the hard drive. There may be significant performance improvements with hard drives with a small cache (4 MB buffer) and hybrid drives with a few GB of storage, as data latency for sequential reads or reads from the hybrid section may be several orders of magnitude faster.

A processor core can switch between threads (cost for pausing and resuming a thread is around 100 clock cycles) while the first thread waits for a high latency input (anything more expensive than registers (1 clock) and RAM (5 nanoseconds)) These include disk I/O, network access (latency of 250ms), reading data off a CD or a slow bus, or a database call. Having more threads than cores means useful work can be done while high-latency tasks are resolved.

The CPU has a thread scheduler that assigns priority to each thread, and allows a thread to sleep, then resume after a predetermined time. It is the thread scheduler's job to reduce thrashing, which would occur if each thread executed just 100 instructions before being put to sleep again. The overhead of switching threads would reduce the total useful throughput of the processor core.

For this reason, you may want to break up your problem in to a reasonable number of threads. If you were writing code to perform matrix multiplication, creating one thread per cell in the output matrix might be excessive, whereas one thread per row or per n rows in the output matrix might reduce the overhead cost of creating, pausing, and resuming threads.

This is also why branch prediction is important. If you have an if statement that requires loading a value from RAM but the body of the if and else statements use values already loaded into registers, the processor may execute one or both branches before the condition has been evaluated. Once the condition returns, the processor will apply the result of the corresponding branch and discard the other. Performing potentially useless work here is probably better than switching to a different thread, which could lead to thrashing.

As we have moved away from high clock-speed single-core processors to multi-core processors, chip design has focused on cramming more cores per die, improving on-chip resource sharing between cores, better branch prediction algorithms, better thread switching overhead, and better thread scheduling.


Most of the answers above talk about performance and simultaneous operation. I'm going to approach this from a different angle.

Let's take the case of, say, a simplistic terminal emulation program. You have to do the following things:

  • watch for incoming characters from the remote system and display them
  • watch for stuff coming from the keyboard and send them to the remote system

(Real terminal emulators do more, including potentially echoing the stuff you type onto the display as well, but we'll pass over that for now.)

Now the loop for reading from the remote is simple, as per the following pseudocode:

while get-character-from-remote:
    print-to-screen character

The loop for monitoring the keyboard and sending is also simple:

while get-character-from-keyboard:
    send-to-remote character

The problem, though, is that you have to do this simultaneously. The code now has to look more like this if you don't have threading:

loop:
    check-for-remote-character
    if remote-character-is-ready:
        print-to-screen character
    check-for-keyboard-entry
    if keyboard-is-ready:
        send-to-remote character

The logic, even in this deliberately simplified example that doesn't take into account real-world complexity of communications, is quite obfuscated. With threading, however, even on a single core, the two pseudocode loops can exist independently without interlacing their logic. Since both threads will be mostly I/O-bound, they don't put a heavy load on the CPU, even though they are, strictly speaking, more wasteful of CPU resources than the integrated loop would be.

Now of course real-world usage is more complicated than the above. But the complexity of the integrated loop goes up exponentially as you add more concerns to the application. The logic gets ever more fragmented and you have to start using techniques like state machines, coroutines, et al to get things manageable. Manageable, but not readable. Threading keeps the code more readable.

So why would you not use threading?

Well, if your tasks are CPU-bound instead of I/O-bound, threading actually slows your system down. Performance will suffer. A lot, in many cases. ("Thrashing" is a common problem if you drop too many CPU-bound threads. You wind up spending more time changing the active threads than you do running the contents of the threads themselves.) Also, one of the reasons the logic above is so simple is that I've very deliberately chosen a simplistic (and unrealistic) example. If you wanted to echo what was typed to the screen then you've got a new world of hurt as you introduce locking of shared resources. With only one shared resource this isn't so much a problem, but it does start to become a bigger and bigger problem as you have more resources to share.

So in the end, threading is about many things. For example, it's about making I/O-bound processes more responsive (even if less efficient overall) as some have already said. It's also about making logic easier to follow (but only if you minimize shared state). It's about a lot of stuff, and you have to decide if its advantages outweigh its disadvantages on a case by case basis.