Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how can you measure the time spent in a context switch under java platform

Tags:

Let's assume each thread is doing some FP calculation, I am interested in

  • how much time the CPU is used in switching threads instead of running them
  • how much synchronization traffic is created on shared memory bus - when threads share data, they must use synchronization mechanism

My question: how to design a test program to get this data?

like image 244
SecureFish Avatar asked Aug 08 '11 19:08

SecureFish


People also ask

How can you measure the time spent in context switching?

Calculating context switch time One suitable method could be to record the end instruction timestamp of a process and start timestamp of a process and waiting time in the queue. If all the processes' total execution time was T, then the context switch time = T – (SUM for all processes (waiting time + execution time)).

How would you measure the context switch overhead between threads?

To measure how long it takes to switch between two threads, we need a benchmark that deliberatly triggers a context switch and avoids doing too much work in addition to that. This would be measuring just the direct cost of the switch, when in reality there is another cost - the indirect one, which could even be larger.

What is a context switch what actions are taken in the kernel and how much time it usually takes in today's systems?

Context switching involves saving the state of Process 1 into PCB1 and loading the state of process 2 from PCB2. After some time again a context switch occurs and Process 2 is switched out and Process 1 is switched in again.


1 Answers

You can't easily differentiate the waste due to thread-switching and that due to memory cache contention. You CAN measure the thread contention.. Namely, on linux, you can cat /proc/PID/XXX and get tons of detailed per-thread statistics. HOWEVER, since the pre-emptive scheduler is not going to shoot itself in the foot, you're not going to get more than say 30 ctx switches per second no matter how many threads you use.. And that time is going to be relatively small v.s. the amount of work you're doing.. The real cost of context-switching is the cache pollution. e.g. there is a high probability that you'll have mostly cache misses once you're context-switched back in. Thus OS time and context-switch-counts are of minimal value.

What's REALLY valuable is the ratio of inter-thread cache-line dirties. Depending on the CPU, a cache-line dirty followed by a peer-CPU read is SLOWER than a cache-miss - because you have to force the peer CPU to write it's value to main-mem before you can even start reading.. Some CPUs let you pull from peer cache-lines without hitting main-mem.

So the key is the absolutely minimize ANY shared modified memory structures.. Make everything as read-only as possible.. This INCLUDES share FIFO buffers (including Executor pools).. Namely if you used a synchronized queue - then every sync-op is a shared dirty memory region. And more-over, if the rate is high enough, it'll likely trigger an OS trap to stall, waiting for peer thread's mutex's.

The ideal is to segment RAM, distribute to a fixed number of workers a single large unit of work, then use a count-down-latch or some other memory barrier (such that each thread would only touch it once). Ideally any temporary buffers are pre-allocated instead of going into and out of a shared memory pool (which then causes cache contention). Java 'synchronized' blocks leverage (behind the scenes) a shared hash-table memory space and thus trigger the undesirable dirty-reads, I haven't determined if java 5 Lock objects avoid this, but you're still leveraging OS stalls which won't help in your throughput. Obviously most OutputStream operations trigger such synchronized calls (and of course are typically filling a common stream buffer).

Generally my experience is that single-threading is faster than mulithreading for a common byte-array/object-array, etc. At least with simplistic sorting/filtering algorithms that I've experimented with. This is true both in Java and C in my experience. I haven't tried FPU intesive ops (like divides, sqrt), where cache-lines may be less of a factor.

Basically if you're a single CPU you don't have cache-line problems (unless the OS is always flushing the cache even in shared threads), but multithreading buys you less than nothing. In hyperthreading, it's the same deal. In single-CPU shared L2/L3 cache configurations (e.g. AMDs), you might find some benefit. In multi CPU Intel BUS's, forget it - shared write-memory is worse than single-threading.

like image 127
M. Maraist Avatar answered Oct 22 '22 02:10

M. Maraist