Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is passing by const ref slower when using std::async

As an exercise to learn about std::async I wrote a small program that calculates the sum of a large vector<int>, distributed about a lot of threads.

My code below is as follows

#include <iostream>
#include <vector>
#include <future>
#include <chrono>

typedef unsigned long long int myint;

// Calculate sum of part of the elements in a vector
myint partialSum(const std::vector<myint>& v, int start, int end)
{
    myint sum(0);
    for(int i=start; i<=end; ++i)
    {
        sum += v[i];
    }
    return sum;
}

int main()
{
    const int nThreads = 100;
    const int sizePerThread = 100000;
    const int vectorSize = nThreads * sizePerThread;
    
    std::vector<myint> v(vectorSize);   
    std::vector<std::future<myint>> partial(nThreads);
    myint tot = 0;
    
    // Fill vector  
    for(int i=0; i<vectorSize; ++i)
    {
        v[i] = i+1;
    }
    std::chrono::steady_clock::time_point startTime = std::chrono::steady_clock::now();
    
    // Start threads
    for( int t=0; t < nThreads; ++t)
    {
        partial[t] = std::async( std::launch::async, partialSum, v, t*sizePerThread, (t+1)*sizePerThread -1);               
    }
    
    // Sum total
    for( int t=0; t < nThreads; ++t)
    {
        myint ps = partial[t].get();
        std::cout << t << ":\t" << ps << std::endl;
        tot += ps;
    }
    std::cout << "Sum:\t" << tot << std::endl;
    
    std::chrono::steady_clock::time_point endTime = std::chrono::steady_clock::now();
    std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count() <<std::endl;
}

My question is concerned about the calls to the function partialSum, and then especially how the large vector is passed. The function is called as follows:

        partial[t] = std::async( std::launch::async, partialSum, v, t*sizePerThread, (t+1)*sizePerThread -1);       

with the definition as follows

myint partialSum(const std::vector<myint>& v, int start, int end)

With this approach, the calculation is relatively slow. If I use std::ref(v) in the std::async function call, my function is a lot quicker and more efficient. This still makes sense to me.

However, if I still call by v, instead of std::ref(v), but replace the function with

myint partialSum(std::vector<myint> v, int start, int end)

the program also runs a lot quicker (and uses less memory). I don't understand why the const ref implementation is slower. How does the compiler fix this without any references in place?

With the const ref implementation this program typically takes 6.2 seconds to run, without 3.0. (Note that with const ref, and std::ref it runs in 0.2 seconds for me)

I am compiling with g++ -Wall -pedantic using (adding the -O3 when passing just v demonstrates the same effect)

g++ --version

g++ (Rev1, Built by MSYS2 project) 6.3.0 Copyright (C) 2016 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

like image 557
Bernhard Avatar asked Nov 22 '17 14:11

Bernhard


1 Answers

The short story

given a copy and move-constructible type T

V f(T);
V g(T const&);

T t;
auto v = std::async(f,t).get();
auto v = std::async(g,t).get();

the only relevant difference concerning the two async calls is that, in the first one, t's copy is destroyed as soon as f returns; in the second, t's copy may be destroyed as per effect of the get() call. If the async calls happen in a loop with the future being get() later, the first will have constant memory on avarage (assuming a constant per-thread workload), the second a linearly growing memory at worst, resulting in more cache hits and worse allocation performance.

The long story

First of all, I can reproduce the observed slowdown (consistently ~2x in my system) on both gcc and clang; moreover, the same code with equivalent std::thread invocations does not manifest the same behavior, with the const& version turning out slightly faster as expected. Let's see why.

Firstly, the specification of async reads:

[futures.async] If launch::async is set in policy, calls INVOKE(DECAY_COPY(std::forward(f)), DECAY_COPY(std::forward(args))...) (23.14.3, 33.3.2.2) as if in a new thread of execution represented by a thread object with the calls to DECAY_COPY() being evaluated in the thread that called async[...]The thread object is stored in the shared state and affects the behavior of any asynchronous return objects that reference that state.

so, async will copy the arguments forwarding those copies to the callable, preserving rvalueness; in this regard, it's the same as of std::thread constructor, and there's no difference in the two OP versions, both copy the vector.

The difference is in the bold part: the thread object is part of the shared state and will not be freed until the latter gets released (eg. by a future::get() call).

Why is this important ? because the standard does not specify to who the decayed copies are bound, we only know that they must outlive the callable invocation, but we don't know if they will be destroyed immediately after the call or at thread exit or when the thread object is destroyed (along with the shared state).

In fact, it turns out that gcc and clang implementations store the decayed copies in the shared state of the resulting future.

Consequently, in the const& version the vector copy is stored in the shared state and destroyed at future::get: this results in the "Start threads" loop allocating a new vector at each step, with a linear growth of memory.

Conversely, in the by-value version the vector copy is moved in the callable argument and destroyed as soon as the callable returns; at future::get, a moved, empty vector will be destroyed. So, if the callable is fast enough to destroy the vector before a new one is created, the same vector will be allocated over and over and memory will stay almost constant. This will result in less cache hits and faster allocations, explaining the improved timings.

like image 101
Massimiliano Janes Avatar answered Nov 02 '22 17:11

Massimiliano Janes