Cost of context switch between threads of same process, on Linux

Tags:

Is there any good empirical data on the cost of context switching between threads of the same process on Linux (x86 and x86_64, mainly, are of interest)? I'm talking about the number of cycles or nanoseconds between the last instruction one thread executes in userspace before getting put to sleep voluntarily or involuntarily, and the first instruction a different thread of the same process executes after waking up on the same cpu/core.

I wrote a quick test program that constantly performs rdtsc in 2 threads assigned to the same cpu/core, stores the result in a volatile variable, and compares to its sister thread's corresponding volatile variable. The first time it detects a change in the sister thread's value, it prints the difference, then goes back to looping. I'm getting minimum/median counts of about 8900/9600 cycles this way on an Atom D510 cpu. Does this procedure seem reasonable, and do the numbers seem believable?

My goal is to estimate whether, on modern systems, thread-per-connection server model could be competitive with or even outperform select-type multiplexing. This seems plausible in theory, as the transition from performing IO on fd X to fd Y involves merely going to sleep in one thread and waking up in another, rather than multiple syscalls, but it's dependent on the overhead of context switching.

651

asked May 11 '11 03:05

R.. GitHub STOP HELPING ICE

1 Answers

(Disclaimer: This isn't a direct answer to the question, it's just some suggestions that I hope will be helpful).

Firstly, the numbers you're getting certainly sound like they're within the ballpark. Note, however, that the interrupt / trap latency can vary a lot among different CPU models implementing the same ISA. It's also a different story if your threads have used floating point or vector operations, because if they haven't the kernel avoids saving/restoring the floating point or vector unit state.

You should be able to get more accurate numbers by using the kernel tracing infrastructure - perf sched in particular is designed to measure and analyse scheduler latency.

If your goal is to model thread-per-connection servers, then you probably shouldn't be measuring involuntary context switch latency - usually in such a server, the majority of context switches will be voluntary, as a thread blocks in read() waiting for more data from the network. Therefore, a better testbed might involve measuring the latency from one thread blocking in a read() to another being woken up from the same.

Note that in a well-written multiplexing server under heavy load, the transition from fd X to fd Y will often involve the same single system call (as the server iterates over a list of active file descriptors returned from a single epoll()). One thread also ought to have less cache footprint than multiple threads, simply through having only one stack. I suspect the only way to settle the matter (for some definition of "settle") might be to have a benchmark shootout...

131

answered Oct 19 '22 01:10

caf

Related questions
                            
                                How can I generate C wrappers over C++ API using SWIG? [duplicate]
                            
                                Why does Visual C++ warn on implicit cast from const void ** to void * in C, but not in C++?
                            
                                C vs assembler vs NEON performance
                            
                                How to emulate strongly typed enum in C?
                            
                                How to compare socket address in C?
                            
                                Valgrind stack misses a function completely
                            
                                Is there a way of compiling C11 to C89?
                            
                                Why putchar, toupper, tolower, etc. take a int instead of a char?
                            
                                How do I properly use Python's C API and exceptions?
                            
                                Optimizing linear access to arrays with pre-fetching and cache in C
                            
                                Example code to trigger Clang's static analyser
                            
                                Is scanf's "regex" support a standard?
                            
                                how to postpone a function call in c
                            
                                %n format specifier program giving different outputs on different compilers. Why?
                            
                                Array bounds checks on 64-bit hardware using hardware memory-protection
                            
                                Which file systems support splicing via Linux's splice(2)?
                            
                                What float value makes sprintf_s() produce "1.#QO"?
                            
                                Why would the size of a packed structure be different on Linux and Windows when using gcc?
                            
                                Cache-friendly copying of an array with readjustment by known index, gather, scatter
                            
                                Equivalent of CGPoint with integers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cost of context switch between threads of same process, on Linux

Tags:

c

linux

posix

pthreads

context-switch

R.. GitHub STOP HELPING ICE

People also ask

1 Answers

caf

Recent Activity

Donate For Us