Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Console output order slows down multi-threaded program

When compiling the following code

#include <iostream>
#include <vector>
#include <thread>
#include <chrono>
#include <mutex>

std::mutex cout_mut;

void task()
{
    for(int i=0; i<10; i++)
    {
        double d=0.0;
        for(size_t cnt=0; cnt<200000000; cnt++) d += 1.23456;

        std::lock_guard<std::mutex> lg(cout_mut);
        std::cout << d << "(Help)" << std::endl;
        //        std::cout << "(Help)" << d << std::endl;
    }
}

int main()
{
    std::vector<std::thread> all_t(std::thread::hardware_concurrency());

    auto t_begin = std::chrono::high_resolution_clock::now();

    for(auto& t : all_t) t = std::thread{task};
    for(auto& t : all_t) t.join();

    auto t_end = std::chrono::high_resolution_clock::now();

    std::cout << "Took : " << (t_end - t_begin).count() << std::endl;
}

Under MinGW 4.8.1 it takes roughly 2.5 seconds to execute on my box. That is approximately the time it takes to only execute the task function single-threadedly.

However, when I uncomment the line in the middle and therefore comment out the line before (that is, when I exchange the order in which d and "(Help)" are written to std::cout) the whole thing takes now 8-9 seconds.

What is the explanation?

I tested again and found out that I only have the problem with MinGW-build x32-4.8.1-win32-dwarf-rev3 but not with MinGW build x64-4.8.1-posix-seh-rev3. I have a 64-bit machine. With the 64-bit compiler both versions take three seconds. However, using the 32-bit compiler, the problem remains (and is not due to release/debug version confusion).

like image 605
Toby Brull Avatar asked Aug 04 '13 16:08

Toby Brull


People also ask

Why is multithreading slow?

In fact, multithreading can be slower due to the overhead of creating the threads and context switching between them. The multithreaded program performed worse due to the overhead of creating 100 threads and forcing them all to wait with the mutex .

Does multi threading improve gaming performance?

Since the CPU only does one thing at a time, you don't necessarily gain any speed benefit from multithreading but it can be handy for making your PC responsive. Say your game needs to load up a large resource file into memory.

What happens if there are too many threads?

Thus software threads tend to evict each other's data, and the cache fighting from too many threads can hurt performance. A similar overhead, at a different level, is thrashing virtual memory. Most computers use virtual memory. Virtual memory resides on disk, and the frequently used portions are kept in real memory.


1 Answers

It has nothing to do with multi-threading. It is a problem of loop optimization. I have rearranged the original code to get something minimalistic demonstrating the issue:

#include <iostream>
#include <chrono>
#include <mutex>

int main()
{
    auto t_begin = std::chrono::high_resolution_clock::now();
    for(int i=0; i<2; i++)
    {
        double d=0.0;
        for(int j=0; j<100000; j++) d += 1.23456;
        std::mutex mutex;
        std::lock_guard<std::mutex> lock(mutex);
#ifdef SLOW
        std::cout << 'a' << d << std::endl;
#else
        std::cout << d << 'a' << std::endl;
#endif
    }
    auto t_end = std::chrono::high_resolution_clock::now();
    std::cout << "Took : " << (static_cast<double>((t_end - t_begin).count())/1000.0) << std::endl;
}

When compiled and executed and with:

g++ -std=c++11 -DSLOW -o slow -O3 b.cpp -lpthread ; g++ -std=c++11 -o fast -O3 b.cpp -lpthread ; ./slow ; ./fast

The output is:

a123456
a123456
Took : 931
123456a
123456a
Took : 373

Most of the difference in timing is explained by the assembly code generated for the inner loop: the fast case accumulates directly in xmm0 while the slow case accumulates into xmm1 - leading to 2 extra movsd instructions.

Now, when compiled with the '-ftree-loop-linear' option:

g++ -std=c++11 -ftree-loop-linear -DSLOW -o slow -O3 b.cpp -lpthread ; g++ -std=c++11 -ftree-loop-linear -o fast -O3 b.cpp -lpthread ; ./slow ; ./fast

The output becomes:

a123456
a123456
Took : 340
123456a
123456a
Took : 346
like image 73
Come Raczy Avatar answered Oct 27 '22 01:10

Come Raczy