When compiling the following code
#include <iostream>
#include <vector>
#include <thread>
#include <chrono>
#include <mutex>
std::mutex cout_mut;
void task()
{
for(int i=0; i<10; i++)
{
double d=0.0;
for(size_t cnt=0; cnt<200000000; cnt++) d += 1.23456;
std::lock_guard<std::mutex> lg(cout_mut);
std::cout << d << "(Help)" << std::endl;
// std::cout << "(Help)" << d << std::endl;
}
}
int main()
{
std::vector<std::thread> all_t(std::thread::hardware_concurrency());
auto t_begin = std::chrono::high_resolution_clock::now();
for(auto& t : all_t) t = std::thread{task};
for(auto& t : all_t) t.join();
auto t_end = std::chrono::high_resolution_clock::now();
std::cout << "Took : " << (t_end - t_begin).count() << std::endl;
}
Under MinGW 4.8.1 it takes roughly 2.5 seconds to execute on my box. That is approximately the time it takes to only execute the task
function single-threadedly.
However, when I uncomment the line in the middle and therefore comment out the line before (that is, when I exchange the order in which d
and "(Help)"
are written to std::cout
) the whole thing takes now 8-9 seconds.
What is the explanation?
I tested again and found out that I only have the problem with MinGW-build x32-4.8.1-win32-dwarf-rev3
but not with MinGW build x64-4.8.1-posix-seh-rev3
. I have a 64-bit machine. With the 64-bit compiler both versions take three seconds. However, using the 32-bit compiler, the problem remains (and is not due to release/debug version confusion).
In fact, multithreading can be slower due to the overhead of creating the threads and context switching between them. The multithreaded program performed worse due to the overhead of creating 100 threads and forcing them all to wait with the mutex .
Since the CPU only does one thing at a time, you don't necessarily gain any speed benefit from multithreading but it can be handy for making your PC responsive. Say your game needs to load up a large resource file into memory.
Thus software threads tend to evict each other's data, and the cache fighting from too many threads can hurt performance. A similar overhead, at a different level, is thrashing virtual memory. Most computers use virtual memory. Virtual memory resides on disk, and the frequently used portions are kept in real memory.
It has nothing to do with multi-threading. It is a problem of loop optimization. I have rearranged the original code to get something minimalistic demonstrating the issue:
#include <iostream>
#include <chrono>
#include <mutex>
int main()
{
auto t_begin = std::chrono::high_resolution_clock::now();
for(int i=0; i<2; i++)
{
double d=0.0;
for(int j=0; j<100000; j++) d += 1.23456;
std::mutex mutex;
std::lock_guard<std::mutex> lock(mutex);
#ifdef SLOW
std::cout << 'a' << d << std::endl;
#else
std::cout << d << 'a' << std::endl;
#endif
}
auto t_end = std::chrono::high_resolution_clock::now();
std::cout << "Took : " << (static_cast<double>((t_end - t_begin).count())/1000.0) << std::endl;
}
When compiled and executed and with:
g++ -std=c++11 -DSLOW -o slow -O3 b.cpp -lpthread ; g++ -std=c++11 -o fast -O3 b.cpp -lpthread ; ./slow ; ./fast
The output is:
a123456
a123456
Took : 931
123456a
123456a
Took : 373
Most of the difference in timing is explained by the assembly code generated for the inner loop: the fast case accumulates directly in xmm0 while the slow case accumulates into xmm1 - leading to 2 extra movsd instructions.
Now, when compiled with the '-ftree-loop-linear' option:
g++ -std=c++11 -ftree-loop-linear -DSLOW -o slow -O3 b.cpp -lpthread ; g++ -std=c++11 -ftree-loop-linear -o fast -O3 b.cpp -lpthread ; ./slow ; ./fast
The output becomes:
a123456
a123456
Took : 340
123456a
123456a
Took : 346
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With