<pre class="prettyprint lang-C++ prettyprint-override"><code>DWORD WINAPI MyThreadFunction(LPVOID lpParam) { volatile auto x = 1; for (auto i = 0; i < 800000000 / MAX_THREADS; ++i) { x += i / 3; } return 0; } </code></pre> This function is run in <code>MAX_THREADS</code> threads. I have run the tests on Intel Core 2 Duo, Windows 7, MS Visual Studio 2012 using Concurrency Visualizer with <code>MAX_THREADS=4</code> and <code>MAX_THREADS=50</code>. <code>test1</code> (4 threads) completed in 7.1 seconds, but <code>test2</code> (50 threads) completed in 5.8 seconds while <code>test1</code> has more context switches than <code>test2</code>. I have run the same tests on Intel Core i5, Mac OS 10.7.5 and got the same results.

I decided to benchmark this myself on my 4-core machine. I directly compared 4 threads with 50 threads by interleaving 100 tests of each. I used my own numbers so that I had a reasonable execution time for each task. The result was as you described. The 50-thread version is marginally faster. Here is a box plot of my results: <img src="https://i.stack.imgur.com/MNuS7.png" alt="Parallel task comparison graph"> Why? I think this comes down to the thread scheduling. The task is not complete until all threads have done their work, and each thread must do a quarter of the job. Because your process is being shared with other processes on the system, if any single thread is switched out to another process, this will delay the entire task. While we are waiting for the last thread to finish, all other cores are idle. Note how the time distribution of the 4-thread test is much wider than the 50-thread test, which we might expect. When you use 50 threads, each thread has less to do. Because of this, any delays in a single thread will have a less significant effect on the total time. When the scheduler is busy rationing cores out to lots of short threads, a delay on one core can be compensated by giving these threads time on another core. The total effect of latency on one core is not as much of a show-stopper. So it would seem that in this case the extra context-switching is not the biggest factor. While the gain is small, it appears to be beneficial to swamp the thread scheduler a little bit, given that the processing is much more significant than the context-switching. As with everything, you must find the correct balance for your application. <hr> [edit] Out of curiosity I ran a test overnight while my computer wasn't doing much else. This time I used 200 samples per test. Again, tests were interleaved to reduce the impact of any localised background tasks. The first plot of these results is for low thread-counts (up to 3 times the number of cores). You can see how some choices of thread count are quite poor... That is, anything that is not a multiple of the number of cores, and especially odd values. <img src="https://i.stack.imgur.com/FAjsm.png" alt="Additional test plot - low thread count"> The second plot is for higher thread-counts (from 3 times the number of cores up to 60). <img src="https://i.stack.imgur.com/qApk6.png" alt="Additional test plot - high thread count"> Above, you can see a definite downward trend as the thread-count increases. You can also see the spread of results narrow as the thread-count increases. In this test, it's interesting to note that the performance of 4-thread and 50-thread tests were about the same and the spread of results in the 4-core test was not as wide as my original test. Because the computer wasn't doing much else, it could dedicate time to the tests. It would be interesting to repeat the test while placing one core under 75% load. And just to keep things in perspective, consider this: <img src="https://i.stack.imgur.com/Ml4gX.png" alt="Scaling threads"> <hr> [Another edit] After posting my last lot of results, I noticed that the jumbled box plot showed a trend for those tests that were multiples of 4, but the data was a little hard to see. I decided to do a test with only multiples of four, and thought I may as well find the point of diminishing returns at the same time. So I used thread counts that are powers of 2, up to 1024. I would have gone higher, but Windows bugged out at around 1400 threads. The result is rather nice, I think. In case you wonder what the little circles are, those are the median values. I chose it instead of the red line that I used previously because it shows the trend more clearly. <img src="https://i.stack.imgur.com/JRvvG.png" alt="Trend for exponentiating the thread-count"> It seems that in this particular case, the pay dirt lies somewhere between 50 and 150 threads. After that, the benefit quickly drops away, and we're entering the territory of excessive thread management and context-switching. The results might vary significantly with a longer or shorter task. In this case, it was a task involving a lot of pointless arithmetic which took approximately 18 seconds to compute on a single core. By tuning only the number of threads, I was able to shave an extra 1.5% to 2% off the median execution time of the 4-thread version.

Why are 50 threads faster than 4?

Tags:

DWORD WINAPI MyThreadFunction(LPVOID lpParam) {     volatile auto x = 1;     for (auto i = 0; i < 800000000 / MAX_THREADS; ++i) {         x += i / 3;     }     return 0; }

This function is run in MAX_THREADS threads.
I have run the tests on Intel Core 2 Duo, Windows 7, MS Visual Studio 2012 using Concurrency Visualizer with MAX_THREADS=4 and MAX_THREADS=50.
test1 (4 threads) completed in 7.1 seconds, but test2 (50 threads) completed in 5.8 seconds while test1 has more context switches than test2.
I have run the same tests on Intel Core i5, Mac OS 10.7.5 and got the same results.

695

asked Apr 28 '13 22:04

dizel3d

2 Answers

I decided to benchmark this myself on my 4-core machine. I directly compared 4 threads with 50 threads by interleaving 100 tests of each. I used my own numbers so that I had a reasonable execution time for each task.

The result was as you described. The 50-thread version is marginally faster. Here is a box plot of my results:

Parallel task comparison graph

Why? I think this comes down to the thread scheduling. The task is not complete until all threads have done their work, and each thread must do a quarter of the job. Because your process is being shared with other processes on the system, if any single thread is switched out to another process, this will delay the entire task. While we are waiting for the last thread to finish, all other cores are idle. Note how the time distribution of the 4-thread test is much wider than the 50-thread test, which we might expect.

When you use 50 threads, each thread has less to do. Because of this, any delays in a single thread will have a less significant effect on the total time. When the scheduler is busy rationing cores out to lots of short threads, a delay on one core can be compensated by giving these threads time on another core. The total effect of latency on one core is not as much of a show-stopper.

So it would seem that in this case the extra context-switching is not the biggest factor. While the gain is small, it appears to be beneficial to swamp the thread scheduler a little bit, given that the processing is much more significant than the context-switching. As with everything, you must find the correct balance for your application.

[edit] Out of curiosity I ran a test overnight while my computer wasn't doing much else. This time I used 200 samples per test. Again, tests were interleaved to reduce the impact of any localised background tasks.

The first plot of these results is for low thread-counts (up to 3 times the number of cores). You can see how some choices of thread count are quite poor... That is, anything that is not a multiple of the number of cores, and especially odd values.

Additional test plot - low thread count

The second plot is for higher thread-counts (from 3 times the number of cores up to 60).

Additional test plot - high thread count

Above, you can see a definite downward trend as the thread-count increases. You can also see the spread of results narrow as the thread-count increases.

In this test, it's interesting to note that the performance of 4-thread and 50-thread tests were about the same and the spread of results in the 4-core test was not as wide as my original test. Because the computer wasn't doing much else, it could dedicate time to the tests. It would be interesting to repeat the test while placing one core under 75% load.

And just to keep things in perspective, consider this:

Scaling threads

[Another edit] After posting my last lot of results, I noticed that the jumbled box plot showed a trend for those tests that were multiples of 4, but the data was a little hard to see.

I decided to do a test with only multiples of four, and thought I may as well find the point of diminishing returns at the same time. So I used thread counts that are powers of 2, up to 1024. I would have gone higher, but Windows bugged out at around 1400 threads.

The result is rather nice, I think. In case you wonder what the little circles are, those are the median values. I chose it instead of the red line that I used previously because it shows the trend more clearly.

Trend for exponentiating the thread-count

It seems that in this particular case, the pay dirt lies somewhere between 50 and 150 threads. After that, the benefit quickly drops away, and we're entering the territory of excessive thread management and context-switching.

The results might vary significantly with a longer or shorter task. In this case, it was a task involving a lot of pointless arithmetic which took approximately 18 seconds to compute on a single core.

By tuning only the number of threads, I was able to shave an extra 1.5% to 2% off the median execution time of the 4-thread version.

125

answered Sep 21 '22 04:09

paddy

It all depends on what your threads are doing.

Your computer can only concurrently run as many threads as there are cores in the system. This includes virtual cores via features like Hyper-threading.

CPU-bound

If your threads are CPU-bound, (meaning they spend the vast majority of their time doing calculations on data that is in memory), you will see little improvement by increasing the number of threads above the number of cores. You actually lose efficiency with more threads running, because of the added overhead of having to context-swtich the threads on and off the CPU cores.

I/O-bound

Where (#threads > #cores) will help, is when your threads are I/O-bound, meaning they spend most of their time waiting on I/O, (hard disk, network, other hardware, etc.) In this case, a thread that is blocked waiting on I/O to complete will be pulled off the CPU, and a thread that is actually ready to do something will be put on instead.

The way to get highest efficiency is to always keep the CPU busy with a thread that's actually doing something. (Not waiting on something, and not context-switching to other threads.)

answered Sep 22 '22 04:09

Jonathon Reinhart

Related questions
                            
                                Android Compass that can Compensate for Tilt and Pitch
                            
                                select multiple child in css [closed]
                            
                                Configure Grunt File Name Matching for Files with Multiple Dots
                            
                                Using jQuery selector and setSelectionRange is not a function
                            
                                AJAX request and PHP class functions
                            
                                Avoid using extra DOM nodes when using nginclude
                            
                                How to set "full texts" as default for phpMyAdmin?
                            
                                How can I install a .egg Python package on Windows (attempt using easy_install not working)
                            
                                Why JavaMail connection timeout is too long
                            
                                use of deleteOnExit() method in java
                            
                                MultipartEntityBuilder and Charset
                            
                                Mongorestore, from meteor production server to local

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With