I've split a complex array processing task into a number of threads to take advantage of multi-core processing and am seeing great benefits. Currently, at the start of the task I create the threads, and then wait for them to terminate as they complete their work. I'm typically creating about four times the number of threads as there are cores, as each thread is liable to take a different amount of time, and having extra threads ensures all cores are kept occupied most of the time. I was wondering would there be much of a performance advantage to creating the threads as the program fires up, keeping them idle until required, and using them as I start processing. Put more simply, how long does it take to start and end a new thread above and beyond the processing within the thread? I'm current starting the threads using
CWinThread *pMyThread = AfxBeginThread(CMyThreadFunc,&MyData,THREAD_PRIORITY_NORMAL);
Typically I will be using 32 threads across 8 cores on a 64 bit architecture. The process in question currently takes < 1 second, and is fired up each time the display is refreshed. If starting and ending a thread is < 1ms, the return doesn't justify the effort. I'm having some difficulty profiling this.
A related question here helps but is a bit vague for what I'm after. Any feedback appreciated.
The first benchmark simply creates, starts and joins threads. The thread's Runnable does no work. On a typical modern PC running Linux with 64bit Java 8 u101, this benchmark shows an average time taken to create, start and join thread of between 33.6 and 33.9 microseconds.
To create a thread, the Windows API supplies the CreateThread( ) function. Each thread has its own stack (see thread vs processes). You can specify the size of the new thread's stack in bytes using the stackSize parameter which is the 2nd argument of CreateThread( ) function in the example below.
a process: because very little memory copying is required (just the thread stack), threads are faster to start than processes. To start a process, the whole process area must be duplicated for the new process copy to start.
When a thread terminates, its termination status changes from STILL_ACTIVE to the exit code of the thread. When a thread terminates, the state of the thread object changes to signaled, releasing any other threads that had been waiting for the thread to terminate.
I wrote this quite a while ago when I had the same basic question (along with another that will be obvious). I've updated it to show a little more about not only how long it takes to create threads, but how long it takes for the threads to start executing:
#include <windows.h>
#include <iostream>
#include <time.h>
#include <vector>
const int num_threads = 32;
const int switches_per_thread = 100000;
DWORD __stdcall ThreadProc(void *start) {
QueryPerformanceCounter((LARGE_INTEGER *) start);
for (int i=0;i<switches_per_thread; i++)
Sleep(0);
return 0;
}
int main(void) {
HANDLE threads[num_threads];
DWORD junk;
std::vector<LARGE_INTEGER> start_times(num_threads);
LARGE_INTEGER l;
QueryPerformanceCounter(&l);
clock_t create_start = clock();
for (int i=0;i<num_threads; i++)
threads[i] = CreateThread(NULL,
0,
ThreadProc,
(void *)&start_times[i],
0,
&junk);
clock_t create_end = clock();
clock_t wait_start = clock();
WaitForMultipleObjects(num_threads, threads, TRUE, INFINITE);
clock_t wait_end = clock();
double create_millis = 1000.0 * (create_end - create_start) / CLOCKS_PER_SEC / num_threads;
std::cout << "Milliseconds to create thread: " << create_millis << "\n";
double wait_clocks = (wait_end - wait_start);
double switches = switches_per_thread*num_threads;
double us_per_switch = wait_clocks/CLOCKS_PER_SEC*1000000/switches;
std::cout << "Microseconds per thread switch: " << us_per_switch;
LARGE_INTEGER f;
QueryPerformanceFrequency(&f);
for (auto s : start_times)
std::cout << 1000.0 * (s.QuadPart - l.QuadPart) / f.QuadPart <<" ms\n";
return 0;
}
Sample results:
Milliseconds to create thread: 0.015625
Microseconds per thread switch: 0.0479687
The first few thread start times look like this:
0.0632517 ms
0.117348 ms
0.143703 ms
0.18282 ms
0.209174 ms
0.232478 ms
0.263826 ms
0.315149 ms
0.324026 ms
0.331516 ms
0.3956 ms
0.408639 ms
0.4214 ms
Note that although these happen to be monotonically increasing, that's not guaranteed (though there is definitely a trend in that general direction).
When I first wrote this, the units I used made more sense -- on a 33 MHz 486, those results weren't tiny fractions like this. :-) I suppose someday when I'm feeling ambitious, I should rewrite this to use std::async
to create the threads and std::chrono
to do the timing, but...
Some advices:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With