I've been told several times, that I should use <code>std::async</code> for fire & forget type of tasks with the <code>std::launch::async</code> parameter (so it does it's magic on a new thread of execution preferably). Encouraged by these statements, I wanted to see how <code>std::async</code> is compared to: <ul> <li>sequential execution</li> <li>a simple detached <code>std::thread</code> </li> <li>my simple async "implementation"</li> </ul> My naive async implementation looks like this: <pre class="prettyprint"><code>template <typename F, typename... Args> auto myAsync(F&& f, Args&&... args) -> std::future<decltype(f(args...))> { std::packaged_task<decltype(f(args...))()> task(std::bind(std::forward<F>(f), std::forward<Args>(args)...)); auto future = task.get_future(); std::thread thread(std::move(task)); thread.detach(); return future; } </code></pre> Nothing fancy here, packs the functor <code>f</code> into an <code>std::packaged task</code> along with its arguments, launches it on a new <code>std::thread</code> which is detached, and returns with the <code>std::future</code> from the task. And now the code measuring execution time with <code>std::chrono::high_resolution_clock</code>: <pre class="prettyprint"><code>int main(void) { constexpr unsigned short TIMES = 1000; auto start = std::chrono::high_resolution_clock::now(); for (int i = 0; i < TIMES; ++i) { someTask(); } auto dur = std::chrono::high_resolution_clock::now() - start; auto tstart = std::chrono::high_resolution_clock::now(); for (int i = 0; i < TIMES; ++i) { std::thread t(someTask); t.detach(); } auto tdur = std::chrono::high_resolution_clock::now() - tstart; std::future<void> f; auto astart = std::chrono::high_resolution_clock::now(); for (int i = 0; i < TIMES; ++i) { f = std::async(std::launch::async, someTask); } auto adur = std::chrono::high_resolution_clock::now() - astart; auto mastart = std::chrono::high_resolution_clock::now(); for (int i = 0; i < TIMES; ++i) { f = myAsync(someTask); } auto madur = std::chrono::high_resolution_clock::now() - mastart; std::cout << "Simple: " << std::chrono::duration_cast<std::chrono::microseconds>(dur).count() << std::endl << "Threaded: " << std::chrono::duration_cast<std::chrono::microseconds>(tdur).count() << std::endl << "std::sync: " << std::chrono::duration_cast<std::chrono::microseconds>(adur).count() << std::endl << "My async: " << std::chrono::duration_cast<std::chrono::microseconds>(madur).count() << std::endl; return EXIT_SUCCESS; } </code></pre> Where <code>someTask()</code> is a simple method, where I wait a little, simulating some work done: <pre class="prettyprint"><code>void someTask() { std::this_thread::sleep_for(std::chrono::milliseconds(1)); } </code></pre> Finally, my results: <ul> <li>Sequential: 1263615</li> <li>Threaded: 47111</li> <li>std::sync: 821441</li> <li>My async: 30784</li> </ul> Could anyone explain these results? It seems like <code>std::aysnc</code> is much slower than my naive implementation, or just plain and simple detached <code>std::thread</code>s. Why is that? After these results is there any reason to use <code>std::async</code>? (Note that I did this benchmark with clang++ and g++ too, and the results were very similar) UPDATE: After reading Dave S's answer I updated my little benchmark as follows: <pre class="prettyprint"><code>std::future<void> f[TIMES]; auto astart = std::chrono::high_resolution_clock::now(); for (int i = 0; i < TIMES; ++i) { f[i] = std::async(std::launch::async, someTask); } auto adur = std::chrono::high_resolution_clock::now() - astart; </code></pre> So the <code>std::future</code>s are now not destroyed - and thus joined - every run. After this change in the code, <code>std::async</code> produces similar results to my implementation & detached <code>std::thread</code>s.

One key difference is that the future returned by async joins the thread when the future is destroyed, or in your case, replaced with a new value. This means it has to execute <code>someTask()</code> and join the thread, both of which take time. None of your other tests are doing that, where they simply spawn them independently.

Why is std::async slow compared to simple detached threads?

Tags:

c++

asynchronous

multithreading

c++11

stdasync

I've been told several times, that I should use std::async for fire & forget type of tasks with the std::launch::async parameter (so it does it's magic on a new thread of execution preferably).

Encouraged by these statements, I wanted to see how std::async is compared to:

sequential execution
a simple detached std::thread
my simple async "implementation"

My naive async implementation looks like this:

template <typename F, typename... Args>
auto myAsync(F&& f, Args&&... args) -> std::future<decltype(f(args...))>
{
    std::packaged_task<decltype(f(args...))()> task(std::bind(std::forward<F>(f), std::forward<Args>(args)...));
    auto future = task.get_future();

    std::thread thread(std::move(task));
    thread.detach();

    return future;
}

Nothing fancy here, packs the functor f into an std::packaged task along with its arguments, launches it on a new std::thread which is detached, and returns with the std::future from the task.

And now the code measuring execution time with std::chrono::high_resolution_clock:

int main(void)
{
    constexpr unsigned short TIMES = 1000;

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < TIMES; ++i)
    {
        someTask();
    }
    auto dur = std::chrono::high_resolution_clock::now() - start;

    auto tstart = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < TIMES; ++i)
    {
        std::thread t(someTask);
        t.detach();
    }
    auto tdur = std::chrono::high_resolution_clock::now() - tstart;

    std::future<void> f;
    auto astart = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < TIMES; ++i)
    {
        f = std::async(std::launch::async, someTask);
    }
    auto adur = std::chrono::high_resolution_clock::now() - astart;

    auto mastart = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < TIMES; ++i)
    {
        f = myAsync(someTask);
    }
    auto madur = std::chrono::high_resolution_clock::now() - mastart;

    std::cout << "Simple: " << std::chrono::duration_cast<std::chrono::microseconds>(dur).count() <<
    std::endl << "Threaded: " << std::chrono::duration_cast<std::chrono::microseconds>(tdur).count() <<
    std::endl << "std::sync: " << std::chrono::duration_cast<std::chrono::microseconds>(adur).count() <<
    std::endl << "My async: " << std::chrono::duration_cast<std::chrono::microseconds>(madur).count() << std::endl;

    return EXIT_SUCCESS;
}

Where someTask() is a simple method, where I wait a little, simulating some work done:

void someTask()
{
    std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

Finally, my results:

Sequential: 1263615
Threaded: 47111
std::sync: 821441
My async: 30784

Could anyone explain these results? It seems like std::aysnc is much slower than my naive implementation, or just plain and simple detached std::threads. Why is that? After these results is there any reason to use std::async?

(Note that I did this benchmark with clang++ and g++ too, and the results were very similar)

UPDATE:

After reading Dave S's answer I updated my little benchmark as follows:

std::future<void> f[TIMES];
auto astart = std::chrono::high_resolution_clock::now();
for (int i = 0; i < TIMES; ++i)
{
    f[i] = std::async(std::launch::async, someTask);
}
auto adur = std::chrono::high_resolution_clock::now() - astart;

So the std::futures are now not destroyed - and thus joined - every run. After this change in the code, std::async produces similar results to my implementation & detached std::threads.

847

asked May 21 '16 12:05

krispet krispet

2 Answers

One key difference is that the future returned by async joins the thread when the future is destroyed, or in your case, replaced with a new value.

This means it has to execute someTask() and join the thread, both of which take time. None of your other tests are doing that, where they simply spawn them independently.

answered Oct 16 '22 14:10

Dave S

sts::async returns a special std::future. This future has a ~future that does a .wait().

So your examples are fundamentally different. The slow ones actually do the tasks during your timing. The fast ones just queue up the tasks, and forget how to ever know the task is done. As the behaviour of programs that let threads last past the end of main is unpredictable, one should avoid it.

The right way to compare the tasks is to store the resulting future when genersting, and before the timer ends either .wait()/.join() them all, or avoid destroying the objects until after the timer expires. This last case, however, makes the sewuential version look worse than it is.

You do need to join/wait before starting the next test, as otherwise you are stealing resources from their timing.

Note that moved futures remove the wait from the source.

answered Oct 16 '22 15:10

Yakk - Adam Nevraumont

Related questions
                            
                                Java assignment operator behavior vs C++
                            
                                How can i lock a MUTEX for an element in the array, not for the complete array
                            
                                how do i build libraries in subdirectories using cmake?
                            
                                Set timeout for boost socket.connect
                            
                                How to pass a value with a clicked signal from a Qt PushButton? [duplicate]
                            
                                C++ Map Concurrent Insertion and Reading by Two threads
                            
                                Break after control statements in clang-format
                            
                                Plotting a Gradient Vector Field in OpenCV
                            
                                Can I make google test return 0 even when tests fail?
                            
                                Graceful termination of Qt application by unix signal
                            
                                Why is this Rcpp code slower than byte compiled R?
                            
                                switch statement multi character constant
                            
                                boost program_options on/off flag
                            
                                Too many copies when binding variadic template arguments
                            
                                removing a unique_ptr of an object from a vector by an attribute value
                            
                                Converting a struct to char array using memcpy
                            
                                c++ ansi escape codes not displaying color to console
                            
                                Using auto (for iterating) in nested range-based for loop
                            
                                core dumped message is not captured in STDERR
                            
                                Why using std::forward on container before accessing element?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With