I believed it was better to process simple and heavy works (ex. matrix-calculation) with multi-threading than with single-thread, so I tested the following code : <pre class="prettyprint"><code>int main() { constexpr int N = 100000; std::random_device rd; std::mt19937 mt(rd()); std::uniform_real_distribution<double> ini(0.0, 10.0); // single-thread { std::vector<int> vec(N); for(int i = 0; i < N; ++i) { vec[i] = ini(mt); } auto start = std::chrono::system_clock::now(); for(int i = 0; i < N; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } auto end = std::chrono::system_clock::now(); auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count(); std::cout << "single : " << dur << " ms."<< std::endl; } // multi-threading (Th is the number of threads) for(int Th : {1, 2, 4, 8, 16}) { std::vector<int> vec(N); for(int i = 0; i < N; ++i) { vec[i] = ini(mt); } auto start = std::chrono::system_clock::now(); std::vector<std::future<void>> fut(Th); for(int t = 0; t < Th; ++t) { fut[t] = std::async(std::launch::async, [t, &vec, &N, &Th]{ for(int i = t*N / Th; i < (t + 1)*N / Th; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); } for(int t = 0; t < Th; ++t) { fut[t].get(); } auto end = std::chrono::system_clock::now(); auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count(); std::cout << "Th = " << Th << " : " << dur << " ms." << std::endl; } return 0; } </code></pre> The execution environment : <pre class="prettyprint"><code>OS : Windows 10 64-bit Build-system : Visual Studio Community 2015 CPU : Core i5 4210U </code></pre> When building this program in the Debug mode, the result was as I expected : <pre class="prettyprint"><code>single : 146 ms. Th = 1 : 140 ms. Th = 2 : 71 ms. Th = 4 : 64 ms. Th = 8 : 61 ms. Th = 16 : 68 ms. </code></pre> This says that the code not using std::async justly has same performance as one using one-thread and when using 4 or 8 threads I can get great performance. However, when in the Release mode, I got a different result (N : 100000 -> 100000000) : <pre class="prettyprint"><code>single : 54 ms. Th = 1 : 443 ms. Th = 2 : 285 ms. Th = 4 : 205 ms. Th = 8 : 206 ms. Th = 16 : 221 ms. </code></pre> I'm wondering this result. Just for the latter half codes, multi-threading just has better performance than single. But the fastest one is the first half codes, which do not use std::async. I know the fact that optimization and overhead around multithreading has much effect on the performance. However, <ul> <li>The process is just calculation of the vector, so what can be optimized not in the multi-thread codes but in the single-thread codes?</li> <li>This program contains nothing about mutex or atomic etc, and data conflict might not occur. I think overheads around multithreading would be relatively small.</li> <li>CPU utilization in the codes not using std::async is smaller than in the multi-threading codes. Is it efficient to use the large part of CPU?</li> </ul> <hr> Update : I tried to research about vectorization. I enabled <code>/Qvec-report:1</code> options and got the fact: <pre class="prettyprint"><code>//vectorized (when N is large) for(int i = 0; i < N; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } //not vectorized auto lambda = [&vec, &N]{ for(int i = 0; i < N; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }; lambda(); //not vectorized std::vector<std::future<void>> fut(Th); for(int t = 0; t < Th; ++t) { fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{ for(int i = t*N / Th; i < (t + 1)*N / Th; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); } </code></pre> and run time : <pre class="prettyprint"><code>single (with vectorization) : 47 ms. single (without vectorization) : 70 ms. </code></pre> It was sure that for-loop was not vectorized in multi-threaded version. However, the version needs much time also due to any other reasons. <hr> Update 2 : I rewrote for-loop in the lambda (Type A to Type B) : <pre class="prettyprint"><code>//Type A (the previous one) fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{ for(int i = t*N / Th; i < (t + 1)*N / Th; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); //Type B (the new one) fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{ int nb = t * N / Th; int ne = (t + 1) * N / Th; for(int i = nb; i < ne; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); </code></pre> Type B worked well. The result : <pre class="prettyprint"><code>single (vectorized) : 44 ms. single (invectorized) : 77 ms. -- Th = 1 (Type A) : 435 ms. Th = 2 (Type A) : 278 ms. Th = 4 (Type A) : 219 ms. Th = 8 (Type A) : 212 ms. -- Th = 1 (Type B) : 112 ms. Th = 2 (Type B) : 74 ms. Th = 4 (Type B) : 60 ms. Th = 8 (Type B) : 61 ms. </code></pre> The result of Type B is understandable (multi-threaded codes would run faster than single-threaded invectorized codes, and not as fast as vectorized codes). On the other hand, Type A seems to be equivalent to Type B (just using temporary variables) but these show the different performance. The two-types can be considered to generete different assembly codes. <hr> Update 3 : I might find a factor which slowed down the multi-threaded for-loop. It is division in the condition of <code>for</code>. This is single-threaded test : <pre class="prettyprint"><code>//ver 1 (ordinary) fut[t] = std::async(std::launch::async, [&vec, &N]{ for(int i = 0; i < N; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); //ver 2 (introducing a futile variable Q) int Q = 1; fut[t] = std::async(std::launch::async, [&vec, &N, Q]{ for(int i = 0; i < N / Q; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); //ver 3 (using a temporary variable) int Q = 1; fut[t] = std::async(std::launch::async, [&vec, &N, Q]{ int end = N / Q; for(int i = 0; i < end; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); //ver 4 (using a raw value) fut[t] = std::async(std::launch::async, [&vec]{ for(int i = 0; i < 100000000; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); </code></pre> And running time : <pre class="prettyprint"><code>ver 1 : 132 ms. ver 2 : 391 ms. ver 3 : 47 ms. ver 4 : 43 ms. </code></pre> ver 3 & 4 were well optimazed, and ver 1 was not as much because I think the compiler could not catch N as invariable although N was <code>constexpr</code>. I think ver 2 was very slow because of the same reason. The compiler didn't understand that N and Q wouldn't vary. So the condition <code>i < N / Q</code> would need heavy assembly codes, which slowed down the for-loop.

When you run single threaded, your single thread has <code>vec</code> in the caches, as you've just created it from mt. And it'd keep streaming nicely through caches as it's the only user of all cache levels. I don't think much vectorization is going on here or you'd get shorter times. I could be wrong, though, as memory bandwidth is the key here. Did you look at the asm? <ol> <li>Any other threads would have to fetch ram. This in itself is not a big issue in your case as it's a single cpu so L3 is shared and the data set is larger than L3 anyway. BUT, multiple threads fighting for L3 is bad. I think this is the main factor here.</li> <li>You run too many threads. You should run as many threads as you have cores to pay less for context switching and cache littering. HT is beneficial when the 2 hw threads have enough "holes" in pipelines (not the case here), BP (not the case here), and in cache utilization (strong case here -> see #1). I'm actually surprised >2 threads didn't degrade much more --- nowadays cpus are amazing!</li> <li>Thread launch and term times are less than predictable. If you want more predictability, run the threads constantly and use some cheap signalling to start them and notify they're done.</li> </ol> EDIT: Answers to specific questions <blockquote> The process is just calculation of the vector, so what can be optimized not in the multi-thread codes but in the single-thread codes? </blockquote> Not much code here to optimize.... You can break down the long loops to enable loop unrolling: <pre class="prettyprint"><code>C = 16; // try other C values? for(int i=nb; i<ne; i+=C) { for(int j=0; j<C; j++) vec[i+j] = ...; // that's === vec[i] <<= 2; } // need to do the remainder.... </code></pre> You can vectorize by hand if the compiler didn't. Look at the assembli first. <blockquote> This program contains nothing about mutex or atomic etc, and data conflict might not occur. I think overheads around multithreading would be relatively small. </blockquote> True. Except that threads may start in their own time. Especially on Windows and especially if there's many of them. <blockquote> CPU utilization in the codes not using std::async is smaller than in the multi-threading codes. Is it efficient to use the large part of CPU? </blockquote> You always want to use more cpu % for shorter time. I'm not sure what are you seeing as there's no IO here.

Why will for-loop with multithreading not have as great performance as with single-thread?

Q: Why is multithreading slower than single thread?

Every thread needs some overhead and system resources, so it also slows down performance. Another problem is the so called "thread explosion" when MORE thread are created than cores are on the system. And some waiting threads for the end of other threads is the worst idea for multi threading.

Q: Will multithreading always improve the performance of a single threaded solution?

Even on a single-core platform, multithreading can boost the performance of such applications because individual threads are able to perform IO (causing them to block), while others within the same process continue to run.

Q: How does multithreading provide better performance than single threaded solution?

In a multiprocessor architecture, each thread can run on a different processor in parallel using multithreading. This increases concurrency of the system. This is in direct contrast to a single processor system, where only one process or thread can run on a processor at a time.

Q: Is multithreading faster than single thread?

Multithreading is always faster than serial. Dispatching a cpu heavy task into multiple threads won't speed up the execution. On the contrary it might degrade overall performance. Imagine it like this: if you have 10 tasks and each takes 10 seconds, serial execution will take 100 seconds in total.

Q: Is multi-threading so hard to implement?

I understand that multi-threading is hard to implement and has drawbacks if number of threads is less than expected. But not using these idle cores seems so irrational. Show activity on this post. The proliferation of multi-core CPUs is predominantly driven by supply, not by demand.

Q: Does increasing the number of threads on a system increase performance?

It would also be useful to bear in mind that threads still have their own type of context switching that goes on. Increasing the number of threads doesn't increase performance capacity (as you pointed out) but it also drains CPU time by giving the kernel more work to do.

Q: Why does my program run slow when I have 50 threads?

In your case, when there are 50 threads a lot of context switching takes place when compared to just running 10 threads. This time overhead introduced because of context switching is what making your program run slow "Why does this happen?" is kind of easy to answer. Imagine you have two swimming pools, one full and one empty.

Q: What is the purpose of multiple threads in an application?

To keep the UI (or some other task) responsive. Your typical data entry application may have no use for multiple threads because there is just one task and data cannot be processed before the user submits it. When he does submit, he will be interested in the results (if any) so there is nothing that can be done in parallel.

Tags:

c++

performance

multithreading

c++11

stdasync

I believed it was better to process simple and heavy works (ex. matrix-calculation) with multi-threading than with single-thread, so I tested the following code :

int main()
{
    constexpr int N = 100000;

    std::random_device rd;
    std::mt19937 mt(rd());
    std::uniform_real_distribution<double> ini(0.0, 10.0);

    // single-thread
    {
        std::vector<int> vec(N);
        for(int i = 0; i < N; ++i)
        {
            vec[i] = ini(mt);
        }

        auto start = std::chrono::system_clock::now();

        for(int i = 0; i < N; ++i)
        {
            vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
        }

        auto end = std::chrono::system_clock::now();
        auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
        std::cout << "single : " << dur << " ms."<< std::endl;
    }

    // multi-threading (Th is the number of threads)
    for(int Th : {1, 2, 4, 8, 16})
    {
        std::vector<int> vec(N);
        for(int i = 0; i < N; ++i)
        {
            vec[i] = ini(mt);
        }

        auto start = std::chrono::system_clock::now();

        std::vector<std::future<void>> fut(Th);
        for(int t = 0; t < Th; ++t)
        {
            fut[t] = std::async(std::launch::async, [t, &vec, &N, &Th]{
                for(int i = t*N / Th; i < (t + 1)*N / Th; ++i)
                {
                    vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
                }
            });
        }
        for(int t = 0; t < Th; ++t)
        {
            fut[t].get();
        }

        auto end = std::chrono::system_clock::now();
        auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
        std::cout << "Th = " << Th << " : " << dur << " ms." << std::endl;
    }

    return 0;
}

The execution environment :

OS : Windows 10 64-bit
Build-system : Visual Studio Community 2015
CPU : Core i5 4210U

When building this program in the Debug mode, the result was as I expected :

single : 146 ms.
Th = 1 : 140 ms.
Th = 2 : 71 ms.
Th = 4 : 64 ms.
Th = 8 : 61 ms.
Th = 16 : 68 ms.

This says that the code not using std::async justly has same performance as one using one-thread and when using 4 or 8 threads I can get great performance.

However, when in the Release mode, I got a different result (N : 100000 -> 100000000) :

single : 54 ms.
Th = 1 : 443 ms.
Th = 2 : 285 ms.
Th = 4 : 205 ms.
Th = 8 : 206 ms.
Th = 16 : 221 ms.

I'm wondering this result. Just for the latter half codes, multi-threading just has better performance than single. But the fastest one is the first half codes, which do not use std::async. I know the fact that optimization and overhead around multithreading has much effect on the performance. However,

The process is just calculation of the vector, so what can be optimized not in the multi-thread codes but in the single-thread codes?
This program contains nothing about mutex or atomic etc, and data conflict might not occur. I think overheads around multithreading would be relatively small.
CPU utilization in the codes not using std::async is smaller than in the multi-threading codes. Is it efficient to use the large part of CPU?

Update : I tried to research about vectorization. I enabled /Qvec-report:1 options and got the fact:

//vectorized (when N is large)
for(int i = 0; i < N; ++i)
{
    vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
}

//not vectorized
auto lambda = [&vec, &N]{
    for(int i = 0; i < N; ++i)
    {
        vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
    }
};
lambda();

//not vectorized
std::vector<std::future<void>> fut(Th);
for(int t = 0; t < Th; ++t)
{
    fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{
        for(int i = t*N / Th; i < (t + 1)*N / Th; ++i)
        {
            vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
        }
    });
}

and run time :

single (with vectorization) : 47 ms.
single (without vectorization)  : 70 ms.

It was sure that for-loop was not vectorized in multi-threaded version. However, the version needs much time also due to any other reasons.

Update 2 : I rewrote for-loop in the lambda (Type A to Type B) :

//Type A (the previous one)
fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{
    for(int i = t*N / Th; i < (t + 1)*N / Th; ++i)
    {
        vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
    }
});

//Type B (the new one)
fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{
    int nb = t * N / Th;
    int ne = (t + 1) * N / Th;
    for(int i = nb; i < ne; ++i)
    {
        vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
    }
});

Type B worked well. The result :

single (vectorized) : 44 ms.
single (invectorized) : 77 ms.
--
Th = 1 (Type A) : 435 ms.
Th = 2 (Type A) : 278 ms.
Th = 4 (Type A) : 219 ms.
Th = 8 (Type A) : 212 ms.
--
Th = 1 (Type B) : 112 ms.
Th = 2 (Type B) : 74 ms.
Th = 4 (Type B) : 60 ms.
Th = 8 (Type B) : 61 ms.

The result of Type B is understandable (multi-threaded codes would run faster than single-threaded invectorized codes, and not as fast as vectorized codes). On the other hand, Type A seems to be equivalent to Type B (just using temporary variables) but these show the different performance. The two-types can be considered to generete different assembly codes.

Update 3 : I might find a factor which slowed down the multi-threaded for-loop. It is division in the condition of for. This is single-threaded test :

//ver 1 (ordinary)
fut[t] = std::async(std::launch::async, [&vec, &N]{
    for(int i = 0; i < N; ++i)
    {
        vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
    }
});

//ver 2 (introducing a futile variable Q)
int Q = 1;
fut[t] = std::async(std::launch::async, [&vec, &N, Q]{
    for(int i = 0; i < N / Q; ++i)
    {
        vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
    }
});

//ver 3 (using a temporary variable)
int Q = 1;
fut[t] = std::async(std::launch::async, [&vec, &N, Q]{
    int end = N / Q;
    for(int i = 0; i < end; ++i)
    {
        vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
    }
});

//ver 4 (using a raw value)
fut[t] = std::async(std::launch::async, [&vec]{
    for(int i = 0; i < 100000000; ++i)
    {
        vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
    }
});

And running time :

ver 1 : 132 ms.
ver 2 : 391 ms.
ver 3 : 47 ms.
ver 4 : 43 ms.

ver 3 & 4 were well optimazed, and ver 1 was not as much because I think the compiler could not catch N as invariable although N was constexpr. I think ver 2 was very slow because of the same reason. The compiler didn't understand that N and Q wouldn't vary. So the condition i < N / Q would need heavy assembly codes, which slowed down the for-loop.

325

asked Feb 27 '16 05:02

m. bs

1 Answers

When you run single threaded, your single thread has vec in the caches, as you've just created it from mt. And it'd keep streaming nicely through caches as it's the only user of all cache levels.
I don't think much vectorization is going on here or you'd get shorter times. I could be wrong, though, as memory bandwidth is the key here. Did you look at the asm?

Any other threads would have to fetch ram. This in itself is not a big issue in your case as it's a single cpu so L3 is shared and the data set is larger than L3 anyway.
BUT, multiple threads fighting for L3 is bad. I think this is the main factor here.
You run too many threads. You should run as many threads as you have cores to pay less for context switching and cache littering.
HT is beneficial when the 2 hw threads have enough "holes" in pipelines (not the case here), BP (not the case here), and in cache utilization (strong case here -> see #1).
I'm actually surprised >2 threads didn't degrade much more --- nowadays cpus are amazing!
Thread launch and term times are less than predictable. If you want more predictability, run the threads constantly and use some cheap signalling to start them and notify they're done.

EDIT: Answers to specific questions

The process is just calculation of the vector, so what can be optimized not in the multi-thread codes but in the single-thread codes?

Not much code here to optimize.... You can break down the long loops to enable loop unrolling:

C = 16; // try other C values?
for(int i=nb; i<ne; i+=C) {
  for(int j=0; j<C; j++)
    vec[i+j] = ...; // that's === vec[i] <<= 2;
}
// need to do the remainder....

You can vectorize by hand if the compiler didn't. Look at the assembli first.

This program contains nothing about mutex or atomic etc, and data conflict might not occur. I think overheads around multithreading would be relatively small.

True. Except that threads may start in their own time. Especially on Windows and especially if there's many of them.

CPU utilization in the codes not using std::async is smaller than in the multi-threading codes. Is it efficient to use the large part of CPU?

You always want to use more cpu % for shorter time. I'm not sure what are you seeing as there's no IO here.

171

answered Oct 21 '22 13:10

BitWhistler

Related questions
                            
                                Inconsistent XML notation in Qt
                            
                                Using an alternative libc in a cmake project
                            
                                RAM consumption inside a C++ program using Sqlite3 blob
                            
                                How to implement class lock objects for multithreaded access
                            
                                How Linux decides what `malloc` to use?
                            
                                Converting C++ Boost Regexes to Python re regexes [closed]
                            
                                C++ Universal Windows app only works when VS2015 project on install drive
                            
                                How can I get a function to call the function that called it?
                            
                                c++ pointer to function not changed
                            
                                CLion add dependency headers and sources
                            
                                Which data structure is most suitable to implement a Dictionary? [closed]
                            
                                Can't execute compiled C++ exe file
                            
                                Unable to reproduce memory sanitization results from the project's example project
                            
                                How to convert C++/CLI string to const char*
                            
                                Best JNI Pattern for wrapping C++ Objects?
                            
                                What does the "cxa" prefix mean in "__cxa_demangle"?
                            
                                Can I create a transformation matrix from rotation/translation vectors?
                            
                                IO-performing functions called in same statement: Undefined or unspecified?
                            
                                How can I get MatLab to find the Visual C++ Compiler?
                            
                                Strange compiler behaviour with generic lambda in C++14

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With