There are a lot of ways to measure the CPU context switching overhead. It seems it has few resources to measure the GPU context switching overhead. The CPU context switching and GPU's are quite different.
The GPU scheduling is based on warp scheduling. To calculate the overhead of GPU context switching, I need to know the time of warp with context switching and warp without context switching, and do the subtraction to get the overhead.
I am confused about how to measure the time of warp with context switching? Does anyone have some ideas to measure?
Calculating context switch time One suitable method could be to record the end instruction timestamp of a process and start timestamp of a process and waiting time in the queue. If all the processes' total execution time was T, then the context switch time = T – (SUM for all processes (waiting time + execution time)).
To measure how long it takes to switch between two threads, we need a benchmark that deliberatly triggers a context switch and avoids doing too much work in addition to that. This would be measuring just the direct cost of the switch, when in reality there is another cost - the indirect one, which could even be larger.
Context Switching leads to an overhead cost because of TLB flushes, sharing the cache between multiple tasks, running the task scheduler etc. Context switching between two threads of the same process is faster than between two different processes as threads have the same virtual memory maps.
Research shows that, on average, those who context switch experience a 40% decrease in productivity compared to those who don't. Let me put that into perspective. Lost productivity due to context switching costs the global economy an estimated $450 billion annually. That's more than the GDP of most countries!
I don't think it really makes sense to talk about "overhead" of context switching on a GPU.
On a CPU, context switching is done in software, by a function in the kernel called a "scheduler". The scheduler is ordinary code, a sequence of machine instructions that the processor has to run, and time spent running the scheduler is time not spent doing "useful" work.
A GPU, on the other hand, does context switching in hardware, without a scheduler, and it's fast enough that when one task encounters a pipeline stall, another task can be brought in to utilize the pipeline stages that would otherwise be idle. This is called "latency hiding" — delays in one task are hidden by progress in other tasks. The context switches actually allow more useful work to be done in a given timeframe.
For more information, see this answer I wrote to a related question on SuperUser.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With