I am trying to improve the performance of the threaded application with real-time deadlines. It is running on Windows Mobile and written in C / C++. I have a suspicion that high frequency of thread switching might be causing tangible overhead, but can neither prove it or disprove it. As everybody knows, lack of proof is not a proof of opposite :). Thus my question is twofold: <ul> <li>If exists at all, where can I find any actual measurements of the cost of switching thread context?</li> <li>Without spending time writing a test application, what are the ways to estimate the thread switching overhead in the existing application?</li> <li>Does anyone know a way to find out the number of context switches (on / off) for a given thread?</li> </ul>

You can't estimate it. You need to measure it. And it's going to vary depending on the processor in the device. There are two fairly simple ways to measure a context switch. One involves code, the other doesn't. First, the code way (pseudocode): <pre class="prettyprint"><code>DWORD tick; main() { HANDLE hThread = CreateThread(..., ThreadProc, CREATE_SUSPENDED, ...); tick = QueryPerformanceCounter(); CeSetThreadPriority(hThread, 10); // real high ResumeThread(hThread); Sleep(10); } ThreadProc() { tick = QueryPerformanceCounter() - tick; RETAILMSG(TRUE, (_T("ET: %i\r\n"), tick)); } </code></pre> Obviously doing it in a loop and averaging will be better. Keep in mind that this doesn't just measure the context switch. You're also measuring the call to ResumeThread and there's no guarantee the scheduler is going to immediately switch to your other thread (though the priority of 10 should help increase the odds that it will). You can get a more accurate measurement with CeLog by hooking into scheduler events, but it's far from simple to do and not very well documented. If you really want to go that route, Sue Loh has several blogs on it that a search engine can find. The non-code route would be to use Remote Kernel Tracker. Install eVC 4.0 or the eval version of Platform Builder to get it. It will give a graphical display of everything the kernel is doing and you can directly measure a thread context switch with the provided cursor capabilities. Again, I'm certain Sue has a blog entry on using Kernel Tracker as well. All that said, you're going to find that CE intra-process thread context switches are really, really fast. It's the process switches that are expensive, as it requires swapping the active process in RAM and then doing the migration.

How to estimate the thread context switching overhead?

Tags:

c++

c

multithreading

windows-mobile

I am trying to improve the performance of the threaded application with real-time deadlines. It is running on Windows Mobile and written in C / C++. I have a suspicion that high frequency of thread switching might be causing tangible overhead, but can neither prove it or disprove it. As everybody knows, lack of proof is not a proof of opposite :).

Thus my question is twofold:

If exists at all, where can I find any actual measurements of the cost of switching thread context?
Without spending time writing a test application, what are the ways to estimate the thread switching overhead in the existing application?
Does anyone know a way to find out the number of context switches (on / off) for a given thread?

370

asked Nov 20 '08 09:11

Ignas Limanauskas

2 Answers

I doubt you can find this overhead somewhere on the web for any existing platform. There exists just too many different platforms. The overhead depends on two factors:

The CPU, as the necessary operations may be easier or harder on different CPU types
The system kernel, as different kernels will have to perform different operations on each switch

Other factors include how the switch takes place. A switch can take place when

the thread has used all of its time quantum. When a thread is started, it may run for a given amount of time before it has to return control to the kernel that will decide who's next.
the thread was preempted. This happens when another thread needs CPU time and has a higher priority. E.g. the thread that handles mouse/keyboard input may be such a thread. No matter what thread owns the CPU right now, when the user types something or clicks something, he doesn't want to wait till the current threads time quantum has been used up completely, he wants to see the system reacting immediately. Thus some systems will make the current thread stop immediately and return control to some other thread with higher priority.
the thread doesn't need CPU time anymore, because it's blocking on some operation or just called sleep() (or similar) to stop running.

These 3 scenarios might have different thread switching times in theory. E.g. I'd expect the last one to be slowest, since a call to sleep() means the CPU is given back to the kernel and the kernel needs to setup a wake-up call that will make sure the thread is woken up after about the amount of time it requested to sleep, it then must take the thread out of the scheduling process, and once the thread is woken up, it must add the thread again to the scheduling process. All these steeps will take some amount of time. So the actual sleep-call might be longer than the time it takes to switch to another thread.

I think if you want to know for sure, you must benchmark. The problem is that you usually will have to either put threads to sleep or you must synchronize them using mutexes. Sleeping or Locking/Unlocking mutexes has itself an overhead. This means your benchmark will include these overheads as well. Without having a powerful profiler, it's hard to later on say how much CPU time was used for the actual switch and how much for the sleep/mutex-call. On the other hand, in a real life scenario, your threads will either sleep or synchronize via locks as well. A benchmark that purely measures the context switch time is a synthetically benchmark as it does not model any real life scenario. Benchmarks are much more "realistic" if they base on real-life scenarios. Of what use is a GPU benchmark that tells me my GPU can in theory handle 2 billion polygons a second, if this result can never be achieved in a real life 3D application? Wouldn't it be much more interesting to know how many polygons a real life 3D application can have the GPU handle a second?

Unfortunately I know nothing of Windows programming. I could write an application for Windows in Java or maybe in C#, but C/C++ on Windows makes me cry. I can only offer you some source code for POSIX.

#include <stdlib.h> #include <stdint.h> #include <stdio.h> #include <pthread.h> #include <sys/time.h> #include <unistd.h>  uint32_t COUNTER; pthread_mutex_t LOCK; pthread_mutex_t START; pthread_cond_t CONDITION;  void * threads (     void * unused ) {     // Wait till we may fire away     pthread_mutex_lock(&START);     pthread_mutex_unlock(&START);      pthread_mutex_lock(&LOCK);     // If I'm not the first thread, the other thread is already waiting on     // the condition, thus Ihave to wake it up first, otherwise we'll deadlock     if (COUNTER > 0) {         pthread_cond_signal(&CONDITION);     }     for (;;) {         COUNTER++;         pthread_cond_wait(&CONDITION, &LOCK);         // Always wake up the other thread before processing. The other         // thread will not be able to do anything as long as I don't go         // back to sleep first.         pthread_cond_signal(&CONDITION);     }     pthread_mutex_unlock(&LOCK); //To unlock }  int64_t timeInMS () {     struct timeval t;      gettimeofday(&t, NULL);     return (         (int64_t)t.tv_sec * 1000 +         (int64_t)t.tv_usec / 1000     ); }   int main (     int argc,     char ** argv ) {     int64_t start;     pthread_t t1;     pthread_t t2;     int64_t myTime;      pthread_mutex_init(&LOCK, NULL);     pthread_mutex_init(&START, NULL);        pthread_cond_init(&CONDITION, NULL);      pthread_mutex_lock(&START);     COUNTER = 0;     pthread_create(&t1, NULL, threads, NULL);     pthread_create(&t2, NULL, threads, NULL);     pthread_detach(t1);     pthread_detach(t2);     // Get start time and fire away     myTime = timeInMS();     pthread_mutex_unlock(&START);     // Wait for about a second     sleep(1);     // Stop both threads     pthread_mutex_lock(&LOCK);     // Find out how much time has really passed. sleep won't guarantee me that     // I sleep exactly one second, I might sleep longer since even after being     // woken up, it can take some time before I gain back CPU time. Further     // some more time might have passed before I obtained the lock!     myTime = timeInMS() - myTime;     // Correct the number of thread switches accordingly     COUNTER = (uint32_t)(((uint64_t)COUNTER * 1000) / myTime);     printf("Number of thread switches in about one second was %u\n", COUNTER);     return 0; }

Output

Number of thread switches in about one second was 108406

Over 100'000 is not too bad and that even though we have locking and conditional waits. I'd guess without all this stuff at least twice as many thread switches were possible a second.

192

answered Sep 29 '22 20:09

Mecki

You can't estimate it. You need to measure it. And it's going to vary depending on the processor in the device.

There are two fairly simple ways to measure a context switch. One involves code, the other doesn't.

First, the code way (pseudocode):

DWORD tick;  main() {   HANDLE hThread = CreateThread(..., ThreadProc, CREATE_SUSPENDED, ...);   tick = QueryPerformanceCounter();   CeSetThreadPriority(hThread, 10); // real high   ResumeThread(hThread);   Sleep(10); }  ThreadProc() {   tick = QueryPerformanceCounter() - tick;   RETAILMSG(TRUE, (_T("ET: %i\r\n"), tick)); }

Obviously doing it in a loop and averaging will be better. Keep in mind that this doesn't just measure the context switch. You're also measuring the call to ResumeThread and there's no guarantee the scheduler is going to immediately switch to your other thread (though the priority of 10 should help increase the odds that it will).

You can get a more accurate measurement with CeLog by hooking into scheduler events, but it's far from simple to do and not very well documented. If you really want to go that route, Sue Loh has several blogs on it that a search engine can find.

The non-code route would be to use Remote Kernel Tracker. Install eVC 4.0 or the eval version of Platform Builder to get it. It will give a graphical display of everything the kernel is doing and you can directly measure a thread context switch with the provided cursor capabilities. Again, I'm certain Sue has a blog entry on using Kernel Tracker as well.

All that said, you're going to find that CE intra-process thread context switches are really, really fast. It's the process switches that are expensive, as it requires swapping the active process in RAM and then doing the migration.

answered Sep 29 '22 19:09

ctacke

Related questions
                            
                                C++ cache aware programming
                            
                                Comparing STL strings that use different allocators
                            
                                Printing additional output in Google Test
                            
                                Weird use of `?:` in `typeid` code
                            
                                Should I default virtual destructors?
                            
                                Why does std::move prevent RVO?
                            
                                Why is `const T&` not sure to be const?
                            
                                How to convert errno to exception using <system_error>
                            
                                Are there any concurrent containers in C++11? [closed]
                            
                                Template static variable
                            
                                How to force GCC to assume that a floating-point expression is non-negative?
                            
                                Are function static variables thread-safe in GCC?
                            
                                Should I include <xxxx.h> or <cxxxx> in C++ programs?
                            
                                What is the maximum length in chars needed to represent any double value?
                            
                                increment map<string, int> using ++ operator
                            
                                .template (dot-template) construction usage [duplicate]
                            
                                Compilers and argument order of evaluation in C++
                            
                                C++ signal to QML slot in Qt
                            
                                Writing Universal memoization function in C++11
                            
                                How to create a C project with CLion

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With