I'm trying to do a simple experiment where I want to find out what is the right size of a thread pool when you got a bunch of CPU intensive tasks.
I already know that this size should be equal to the number of cores on the machine, but I want to prove that empirically. Here is the code:
public class Main {
public static void main(String[] args) throws ExecutionException {
List<Future> futures = new ArrayList<>();
ExecutorService threadPool = Executors.newFixedThreadPool(4);
long startTime = System.currentTimeMillis();
for (int i = 0; i < 100; i++) {
futures.add(threadPool.submit(new CpuBoundTask()));
}
for (int i = 0; i < futures.size(); i++) {
futures.get(i).get();
}
long endTime = System.currentTimeMillis();
System.out.println("Time = " + (endTime - startTime));
threadPool.shutdown();
}
static class CpuBoundTask implements Runnable {
@Override
public void run() {
int a = 0;
for (int i = 0; i < 90000000; i++) {
a = (int) (a + Math.tan(a));
}
}
}
}
Each task executes in about 700 milliseconds (I think that's enough to be preempted by the ThreadScheduler at least once).
I'm running this on a MacbookPro 2017, 3.1 GHz Intel Core i5, 2 physical cores with hyperthreading activated, so 4 logical CPUs.
I adjusted the size of the threadpool, and I ran this program multiple times (averaging the timings). Here are the results:
1 thread = 57 seconds
2 threads = 29 seconds
4 threads = 18 seconds
8 threads = 18.1 seconds
16 threads = 18.2 seconds
32 threads = 17.8 seconds
64 threads = 18.2 seconds
I was expecting the execution time to be significantly higher, once I add so many threads (more than the number of CPU cores), because of the context switch overhead, but it seems that this doesn't really happen.
I used VisualVM to monitor the program, and looks like all the threads get created and they are in the running state, as expected. Also, the CPU seems to be used properly (close to 95%).
Is there something that I'm missing?
In this case you should use System.nanoTime() instead of System.currentTimeMillis().
Your algorithm stops scaling at 4 threads, for simplicity sake, let us assume that all the threads did the same number of tasks, hence 25 per thread. Each thread took 18 seconds more or less to compute 25 iterations.
In a very simplistic way, when you run with 64 threads, you will have 8 threads per core, and with the first 4 iterations there are 4 threads running (1 per core) in parallel and the other 60 threads are in idle mode waiting for CPU resources to computed their iterations, so you have something like:
Iteration 0 : Thread 1 (running)
Iteration 1 : Thread 2 (running)
Iteration 2 : Thread 3 (running)
Iteration 3 : Thread 4 (running)
Iteration 4 : Thread 5 (waiting)
Iteration 5 : Thread 6 (waiting)
Iteration 6 : Thread 7 (waiting)
Iteration 7 : Thread 8 (waiting)
...
Iteration 63 : Thread 64 (waiting)
when those 4 threads finish their iterations, they will get another iteration each. In the meantime, let us say, that the threads 5 to 8 start working on the next four iterations (again 4 threads performing work in parallel) while the other threads are blocked waiting for CPU, and so on. So you always have 4 threads running in parallel, regardless, and that is why for:
8 threads = 18.1 seconds
16 threads = 18.2 seconds
32 threads = 17.8 seconds
64 threads = 18.2 seconds
you have around the same execution time, approximately the same execution time that took for 4 threads to finishing 25 iterations in parallel.
Because this is a CPU-bound algorithm without problems of:
it does not reflect that much on the overall execution time when you increase the number of threads per core.
First off, the assumption that the context switch overhead increases with the number of threads, is not always correct. Your sample program performs the fixed amount of work. The more threads you have - the less work each thread does, and the less CPU time it receives.
Even when you have hundreds of threads, the OS will not switch between them infinitely often. There is usually a minimum interval (time slice) that a thread is allowed to run without preemption. With too many threads competing for a physical core, each thread will receive its cpu time slice less often (i.e. starvation), but the number of context switches will not grow proportionally to the number of threads.
I measured the number of context switches in your program with Linux perf:
perf stat -e context-switches java Main
And here are the results:
2 threads | 1,445 context-switches
4 threads | 2,417 context-switches
8 threads | 9,280 context-siwtches
16 threads | 9,257 context-switches
32 threads | 9,527 context-switches
64 threads | 9,986 context-switches
A great leap in the context switches expectedly happens when the number of threads becomes more than the number of physical CPUs, but afterwards the number does not grow that much.
OK, wee see roughly 10K context switches. Is that much? As the answers suggest, the latency of a context switch can be estimated as several microseconds. Let's take 10 as an upper bound. So, 10K switches together will take about 100ms, or 25ms per CPU. It's unlikely that your test will detect this overhead. Furthermore, all threads are purely CPU bound - they do not even access memory enough to suffer from CPU cache pollution. They do not access other shared resources either, so there is no indirect context switch overhead in this case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With