I have upgraded a number crunching application to a multi-threaded program, using the C++11 facilities. It works well on Mac OS X but does not benefit from multithreading on Windows (Visual Studio 2013). Using the following toy program
#include <iostream>
#include <thread>
void t1(int& k) {
k += 1;
};
void t2(int& k) {
k += 1;
};
int main(int argc, const char *argv[])
{
int a{ 0 };
int b{ 0 };
auto start_time = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 10000; ++i) {
std::thread thread1{ t1, std::ref(a) };
std::thread thread2{ t2, std::ref(b) };
thread1.join();
thread2.join();
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time_stack = std::chrono::duration_cast<std::chrono::microseconds>(
end_time - start_time).count();
std::cout << "Time: " << time_stack / 10000.0 << " micro seconds" <<
std::endl;
std::cout << a << " " << b << std::endl;
return 0;
}
I have discovered that it takes 34 microseconds to start a thread on Mac OS X and 340 microseconds to do the same on Windows. Am I doing something wrong on the Windows side ? Is it a compiler issue ?
Not a compiler problem (nor an operating system problem, strictly speaking).
It is a well-known fact that creating threads is an expensive operation. This is especially true under Windows (used to be true under Linux prior to clone
as well).
Also, creating and joining a thread is necessarily slow and does not tell a lot about creating a thread as such. Joining presumes that the thread has exited, which can only happen after it has been scheduled to run. Thus, your measurements include delays introduced by scheduling. Insofar, the times you measure are actually pretty good (they could easily be 20 times longer!).
However, it does not matter a lot whether spawning threads is slow anyway.
Creating 20,000 threads like in your benchmark in a real program is a serious error. While it is not strictly illegal or disallowed to create thousands (even millions) of threads, the "correct" way of using threads is to create no more threads than there are approximately CPU cores. One does not create very short-lived threads all the time either.
You might have a few short-lived ones, and you might create a few extra threads (which e.g. block on I/O), but you will not want to create hundreds or thousands of these. Every additional thread (beyond the number of CPU cores) means more context switches, more scheduler work, more cache pressure, and 1MB of address space and 64kB of physical memory gone per thread (due to stack reserve and commit granularity).
Now, assume you create for example 10 threads at program start, it does not matter at all whether this takes 3 milliseconds alltogether. It takes several hundred milliseconds (at least) for the program to start up anyway, nobody will notice a difference.
Visual C++ uses Concurrency Runtime (MS specific) to implement std.thread
features. When you directly call any Concurrency Runtime feature/function, it creates a default runtime object (not going into details). Or, when you call std.thread
function, it does the same as of ConcRT function was invoked.
The creation of default runtime (or say, scheduler) takes sometime, and hence it appear to be taking sometime. Try creating a std::thread
object, let it run; and then execute the benching marking code (whole of above code, for example).
EDIT:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With