Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++11 thread vs async performance (VS2013)

I feel like I'm missing something here...

I slightly altered some code to change from using std::thread to std::async and noticed a substantial performance increase. I wrote up a simple test which I assume should run nearly identically using std::thread as it does using std::async.

std::atomic<int> someCount = 0;
const int THREADS = 200;
std::vector<std::thread> threadVec(THREADS);
std::vector<std::future<void>> futureVec(THREADS);
auto lam = [&]()
{
    for (int i = 0; i < 100; ++i)
        someCount++;
};

for (int i = 0; i < THREADS; ++i)
    threadVec[i] = std::thread(lam);
for (int i = 0; i < THREADS; ++i)
    threadVec[i].join();

for (int i = 0; i < THREADS; ++i)
    futureVec[i] = std::async(std::launch::async, lam);
for (int i = 0; i < THREADS; ++i)
    futureVec[i].get();

I didn't get too deep into analysis, but some preliminary results made it seem that std::async code ran around 10X faster! Results varied slightly with optimizations off, I also tried switching the execution order.

Is this some Visual Studio compiler issue? Or is there some deeper implementation issue I'm overlooking that would account for this performance difference? I thought that std::async was a wrapper around the std::thread calls?


Also considering these differences, I'm wondering what would be the way to get the best performance here? (There are more than std::thread and std::async which create threads)

What about if I wanted detached threads? (std::async can't do that as far as I'm aware)

like image 303
Ace24713 Avatar asked Nov 04 '14 08:11

Ace24713


2 Answers

When you're using async you are not creating new threads, instead you reuse the ones available in a thread pool. Creating and destroying threads is a very expensive operation that requires about 200 000 CPU cycles in Windows OS. On top of that, remember that having a number of threads much bigger than the number of CPU cores means that the operating system needs to spend more time creating them and scheduling them to use the available CPU time in each of the cores.

UPDATE: To see that the numbers of threads being used using std::async is a lot smaller than using std::thread, I have modified the testing code to count the number of unique thread ids used when run either way as below. Results in my PC shows this result:

Number of threads used running std::threads = 200
Number of threads used to run std::async = 4

but the number of threads running std::async show variations from 2 to 4 in my PC. It basically means that std::async will reuse threads instead of creating new ones every time. Curiously, if I increase the computing time of the lambda by replacing 100 by 1000000 iterations in the for loop, the number of async threads increases to 9 but using raw threads it always gives 200. Worth keeping in mind that "Once a thread has finished, the value of std::thread::id may be reused by another thread"

Here is the testing code:

#include <atomic>
#include <vector>
#include <future>
#include <thread>
#include <unordered_set>
#include <iostream>

int main()
{
    std::atomic<int> someCount = 0;
    const int THREADS = 200;
    std::vector<std::thread> threadVec(THREADS);
    std::vector<std::future<void>> futureVec(THREADS);

    std::unordered_set<std::thread::id> uniqueThreadIdsAsync;
    std::unordered_set<std::thread::id> uniqueThreadsIdsThreads;
    std::mutex mutex;

    auto lam = [&](bool isAsync)
    {
        for (int i = 0; i < 100; ++i)
            someCount++;

        auto threadId = std::this_thread::get_id();
        if (isAsync)
        {
            std::lock_guard<std::mutex> lg(mutex);
            uniqueThreadIdsAsync.insert(threadId);
        }
        else
        {
            std::lock_guard<std::mutex> lg(mutex);
            uniqueThreadsIdsThreads.insert(threadId);
        }
    };

    for (int i = 0; i < THREADS; ++i)
        threadVec[i] = std::thread(lam, false); 

    for (int i = 0; i < THREADS; ++i)
        threadVec[i].join();
    std::cout << "Number of threads used running std::threads = " << uniqueThreadsIdsThreads.size() << std::endl;

    for (int i = 0; i < THREADS; ++i)
        futureVec[i] = std::async(lam, true);
    for (int i = 0; i < THREADS; ++i)
        futureVec[i].get();
    std::cout << "Number of threads used to run std::async = " << uniqueThreadIdsAsync.size() << std::endl;
}
like image 111
Darien Pardinas Avatar answered Oct 10 '22 10:10

Darien Pardinas


As all your threads try to update the same atomic<int> someCount, the performance degradation could as well be linked to contention (the atomic making sure that all concurent accesses are sequentialy ordered). The consequence could be that:

  • the threads spend their time waiting.
  • but they consume CPU cycles anyhow
  • so your system throughput is wasted.

With the async() it would then be sufficient that some variations in the scheduling occur, which could result in a significant reduction of contention and increase in throughput. For example, the standard says that launch::async function object would be executed "as if in a new thread of execution represented by a thread object ...". It doesn't say that it must be a dedicated thread (so it can be -- but doesn't have to be -- a thread pool). Another hypothesis could be that the implementation takes a more relaxed scheduling, because nothing says that the thread needs to be executed immediatly (the constraint however is that it's executed before the get()).

Recommendation

Benchmark should be done with separation of concerns in mind. So for multithreading performance, inter-thread synchronisation should be avoided as much as possible.

Keep in mind that if you've more than thread::hardware_concurrency() threads active, there's no true concurrency anymore and the OS has to manage the overhead of context switching.

Edit: Some experimental feedback (2)

With a lam loop of 100, the benchmark result I measure are not usable because of magnitude of error linked to the windows clock resolution of 15 ms.

Test case            Thread      Async 
   10 000 loop          78          31
1 000 000 loop        2743        2670    (the longer the work, the smaler the difference)
   10 000 + yield()    500        1296    (much more context switches) 

When increasing the number of THREADS the timing evolves proportionnally, but only for test cases with short work. This suggest that the observed difference is in fact related to a an overhead at creation of threads rather than by their poor execution.

In a second experiment, I've added code to count the number of threads that are really involved, based on a vector storing this_thread::get_id(); for each execution :

  • For the thread version, no surprise, there are always 200 created (here).
  • Very interestingly the async() version displays between 8 and 15 processes in the case of shorter work, but show increasing number of threads (up to 131 in my tests) when work becomes longer.

This suggest that async does not a traditional thread pool (i.e. with a limited number of threads) but rather reuses threads if they already finished work. This reduces of course the overhead, especially for smaller tasks. (I updated my initial answer accordingly)

like image 34
Christophe Avatar answered Oct 10 '22 12:10

Christophe