Console output order slows down multi-threaded program

Tags:

When compiling the following code

#include <iostream>
#include <vector>
#include <thread>
#include <chrono>
#include <mutex>

std::mutex cout_mut;

void task()
{
    for(int i=0; i<10; i++)
    {
        double d=0.0;
        for(size_t cnt=0; cnt<200000000; cnt++) d += 1.23456;

        std::lock_guard<std::mutex> lg(cout_mut);
        std::cout << d << "(Help)" << std::endl;
        //        std::cout << "(Help)" << d << std::endl;
    }
}

int main()
{
    std::vector<std::thread> all_t(std::thread::hardware_concurrency());

    auto t_begin = std::chrono::high_resolution_clock::now();

    for(auto& t : all_t) t = std::thread{task};
    for(auto& t : all_t) t.join();

    auto t_end = std::chrono::high_resolution_clock::now();

    std::cout << "Took : " << (t_end - t_begin).count() << std::endl;
}

Under MinGW 4.8.1 it takes roughly 2.5 seconds to execute on my box. That is approximately the time it takes to only execute the task function single-threadedly.

However, when I uncomment the line in the middle and therefore comment out the line before (that is, when I exchange the order in which d and "(Help)" are written to std::cout) the whole thing takes now 8-9 seconds.

What is the explanation?

I tested again and found out that I only have the problem with MinGW-build x32-4.8.1-win32-dwarf-rev3 but not with MinGW build x64-4.8.1-posix-seh-rev3. I have a 64-bit machine. With the 64-bit compiler both versions take three seconds. However, using the 32-bit compiler, the problem remains (and is not due to release/debug version confusion).

605

asked Aug 04 '13 16:08

Toby Brull

1 Answers

It has nothing to do with multi-threading. It is a problem of loop optimization. I have rearranged the original code to get something minimalistic demonstrating the issue:

#include <iostream>
#include <chrono>
#include <mutex>

int main()
{
    auto t_begin = std::chrono::high_resolution_clock::now();
    for(int i=0; i<2; i++)
    {
        double d=0.0;
        for(int j=0; j<100000; j++) d += 1.23456;
        std::mutex mutex;
        std::lock_guard<std::mutex> lock(mutex);
#ifdef SLOW
        std::cout << 'a' << d << std::endl;
#else
        std::cout << d << 'a' << std::endl;
#endif
    }
    auto t_end = std::chrono::high_resolution_clock::now();
    std::cout << "Took : " << (static_cast<double>((t_end - t_begin).count())/1000.0) << std::endl;
}

When compiled and executed and with:

g++ -std=c++11 -DSLOW -o slow -O3 b.cpp -lpthread ; g++ -std=c++11 -o fast -O3 b.cpp -lpthread ; ./slow ; ./fast

The output is:

a123456
a123456
Took : 931
123456a
123456a
Took : 373

Most of the difference in timing is explained by the assembly code generated for the inner loop: the fast case accumulates directly in xmm0 while the slow case accumulates into xmm1 - leading to 2 extra movsd instructions.

Now, when compiled with the '-ftree-loop-linear' option:

g++ -std=c++11 -ftree-loop-linear -DSLOW -o slow -O3 b.cpp -lpthread ; g++ -std=c++11 -ftree-loop-linear -o fast -O3 b.cpp -lpthread ; ./slow ; ./fast

The output becomes:

a123456
a123456
Took : 340
123456a
123456a
Took : 346

answered Oct 27 '22 01:10

Come Raczy

Related questions
                            
                                g++ and clang++ different behaviour with alias template [duplicate]
                            
                                Why is the shared_ptr NULL after a throw [duplicate]
                            
                                Can the explicitness of a constructor supposedly be ignored due to the form of initialization? [duplicate]
                            
                                How to install documentation for c++ in xcode? [duplicate]
                            
                                Why do I get -Wunused-lambda-capture only for bool and int but not double? [duplicate]
                            
                                Why do we need to extract the first element of a variadic pack in a C++ alias? [duplicate]
                            
                                alignof(char) == 1?
                            
                                How to capture 0-2 groups in C++ regular expressions and print them?
                            
                                Issue with type name demangling
                            
                                Why is GCC producing a strange error and trying to call the wrong method when template arguments are specified explicitly?
                            
                                Element of shared_array as shared_ptr?
                            
                                Unexpected non-constant std::initializer_list
                            
                                How to correctly resolve incompatible throw specifiers with implicit virtual destructors?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Console output order slows down multi-threaded program

Tags:

c++11

concurrency

mutex

stdthread

Toby Brull

People also ask

1 Answers

Come Raczy

Recent Activity

Donate For Us