How long does thread creation and termination take under Windows?

Q: How long is thread creation?

The first benchmark simply creates, starts and joins threads. The thread's Runnable does no work. On a typical modern PC running Linux with 64bit Java 8 u101, this benchmark shows an average time taken to create, start and join thread of between 33.6 and 33.9 microseconds.

Q: How are threads created in Windows?

To create a thread, the Windows API supplies the CreateThread( ) function. Each thread has its own stack (see thread vs processes). You can specify the size of the new thread's stack in bytes using the stackSize parameter which is the 2nd argument of CreateThread( ) function in the example below.

Q: Why is creating a thread quicker than creating a process?

a process: because very little memory copying is required (just the thread stack), threads are faster to start than processes. To start a process, the whole process area must be duplicated for the new process copy to start.

Q: What happens when thread is terminated?

When a thread terminates, its termination status changes from STILL_ACTIVE to the exit code of the thread. When a thread terminates, the state of the thread object changes to signaled, releasing any other threads that had been waiting for the thread to terminate.

Tags:

c++

performance

multithreading

I've split a complex array processing task into a number of threads to take advantage of multi-core processing and am seeing great benefits. Currently, at the start of the task I create the threads, and then wait for them to terminate as they complete their work. I'm typically creating about four times the number of threads as there are cores, as each thread is liable to take a different amount of time, and having extra threads ensures all cores are kept occupied most of the time. I was wondering would there be much of a performance advantage to creating the threads as the program fires up, keeping them idle until required, and using them as I start processing. Put more simply, how long does it take to start and end a new thread above and beyond the processing within the thread? I'm current starting the threads using

CWinThread *pMyThread = AfxBeginThread(CMyThreadFunc,&MyData,THREAD_PRIORITY_NORMAL);

Typically I will be using 32 threads across 8 cores on a 64 bit architecture. The process in question currently takes < 1 second, and is fired up each time the display is refreshed. If starting and ending a thread is < 1ms, the return doesn't justify the effort. I'm having some difficulty profiling this.

A related question here helps but is a bit vague for what I'm after. Any feedback appreciated.

812

asked Aug 16 '13 13:08

SmacL

2 Answers

I wrote this quite a while ago when I had the same basic question (along with another that will be obvious). I've updated it to show a little more about not only how long it takes to create threads, but how long it takes for the threads to start executing:

#include <windows.h>
#include <iostream>
#include <time.h>
#include <vector>

const int num_threads = 32;

const int switches_per_thread = 100000;

DWORD __stdcall ThreadProc(void *start) {
    QueryPerformanceCounter((LARGE_INTEGER *) start);
    for (int i=0;i<switches_per_thread; i++)
        Sleep(0);
    return 0;
}

int main(void) {
    HANDLE threads[num_threads];
    DWORD junk;

    std::vector<LARGE_INTEGER> start_times(num_threads);

    LARGE_INTEGER l;
    QueryPerformanceCounter(&l);

    clock_t create_start = clock();
    for (int i=0;i<num_threads; i++)
        threads[i] = CreateThread(NULL, 
                            0, 
                            ThreadProc, 
                            (void *)&start_times[i], 
                            0, 
                            &junk);
    clock_t create_end = clock();

    clock_t wait_start = clock();
    WaitForMultipleObjects(num_threads, threads, TRUE, INFINITE);
    clock_t wait_end = clock();

    double create_millis = 1000.0 * (create_end - create_start) / CLOCKS_PER_SEC / num_threads;
    std::cout << "Milliseconds to create thread: " << create_millis << "\n";
    double wait_clocks = (wait_end - wait_start);
    double switches = switches_per_thread*num_threads;
    double us_per_switch = wait_clocks/CLOCKS_PER_SEC*1000000/switches;
    std::cout << "Microseconds per thread switch: " << us_per_switch;

    LARGE_INTEGER f;
    QueryPerformanceFrequency(&f);

    for (auto s : start_times) 
        std::cout << 1000.0 * (s.QuadPart - l.QuadPart) / f.QuadPart <<" ms\n";

    return 0;
}

Sample results:

Milliseconds to create thread: 0.015625
Microseconds per thread switch: 0.0479687

The first few thread start times look like this:

0.0632517 ms
0.117348 ms
0.143703 ms
0.18282 ms
0.209174 ms
0.232478 ms
0.263826 ms
0.315149 ms
0.324026 ms
0.331516 ms
0.3956 ms
0.408639 ms
0.4214 ms

Note that although these happen to be monotonically increasing, that's not guaranteed (though there is definitely a trend in that general direction).

When I first wrote this, the units I used made more sense -- on a 33 MHz 486, those results weren't tiny fractions like this. :-) I suppose someday when I'm feeling ambitious, I should rewrite this to use std::async to create the threads and std::chrono to do the timing, but...

116

answered Nov 02 '22 18:11

Jerry Coffin

Some advices:

If you have lots of work items to process (or there aren't too many, but you have to repeat the whole process time to time), make sure you use some kind of thread pooling. This way you won't have to recreate the threads all the time, and your original question won't matter any more: the threads will be created only one time. I use the QueueUserWorkItem API directly (since my application doesn't use MFC), even that one is not too painful. But in MFC you may have higher level facilities to take advantage of the thread pooling. (http://support.microsoft.com/kb/197728)
Try to select the optimal amount of work for one work item. Of course this depends on the feature of your software: is it supposed to be real time, or it's a number crunching in the background? If it's not real-time, then too small amount of work per work item can hurt performance: by increasing the proportion of overhead of the work distribution across threads.
Since hardware configurations can be very different, if your end-users can have various machines you can include some calibration routines asynchronously during the start of the software, so you can estimate how much time certain operation takes. The result of the calibration can be an input for a better work size setting later for the real calculations.

answered Nov 02 '22 19:11

Csaba Toth

Related questions
                            
                                CMake link a library (.a/.so)
                            
                                SFINAE std::isfinite and similar functions using std::is_arithmetic
                            
                                In-class friend operator doesn't seem to participate in overload resolution
                            
                                Qt #define "signals" clashes with GStreamer (gst)
                            
                                QSharedDataPointer with forward-declared class
                            
                                Can libuv(node.js's async lib) run on Apple IOS / Android?
                            
                                Check if type can be an argument to boost::lexical_cast<string>
                            
                                Executing java file in qt
                            
                                Is masking effective for thwarting side channel attacks?
                            
                                Only relink shared libraries when headers change in CMake
                            
                                Does the multibyte-to-wide-string conversion function "mbstowcs", when passed a string literal, use the encoding of the source file?
                            
                                How to define custom float-point format (type) in C++?
                            
                                Eclipse can't find header filers even though include paths have been set
                            
                                What does "_dyld_start" mean in my profiling results?
                            
                                What is the best suited encoding for C++ source code
                            
                                C++ Input stream: operation order in Solaris vs. Linux
                            
                                Best way to include stdafx.h, when it is 1 directory up?
                            
                                Difference between OpenCV type CV_32F and CV_32FC1
                            
                                C++ map insertion and lookup performance and storage overhead
                            
                                Compiler error C2653: not a class or namespace name

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With