Let's assume each thread is doing some FP calculation, I am interested in <ul> <li>how much time the CPU is used in switching threads instead of running them</li> <li>how much synchronization traffic is created on shared memory bus - when threads share data, they must use synchronization mechanism</li> </ul> My question: how to design a test program to get this data?

You can't easily differentiate the waste due to thread-switching and that due to memory cache contention. You CAN measure the thread contention.. Namely, on linux, you can cat /proc/PID/XXX and get tons of detailed per-thread statistics. HOWEVER, since the pre-emptive scheduler is not going to shoot itself in the foot, you're not going to get more than say 30 ctx switches per second no matter how many threads you use.. And that time is going to be relatively small v.s. the amount of work you're doing.. The real cost of context-switching is the cache pollution. e.g. there is a high probability that you'll have mostly cache misses once you're context-switched back in. Thus OS time and context-switch-counts are of minimal value. What's REALLY valuable is the ratio of inter-thread cache-line dirties. Depending on the CPU, a cache-line dirty followed by a peer-CPU read is SLOWER than a cache-miss - because you have to force the peer CPU to write it's value to main-mem before you can even start reading.. Some CPUs let you pull from peer cache-lines without hitting main-mem. So the key is the absolutely minimize ANY shared modified memory structures.. Make everything as read-only as possible.. This INCLUDES share FIFO buffers (including Executor pools).. Namely if you used a synchronized queue - then every sync-op is a shared dirty memory region. And more-over, if the rate is high enough, it'll likely trigger an OS trap to stall, waiting for peer thread's mutex's. The ideal is to segment RAM, distribute to a fixed number of workers a single large unit of work, then use a count-down-latch or some other memory barrier (such that each thread would only touch it once). Ideally any temporary buffers are pre-allocated instead of going into and out of a shared memory pool (which then causes cache contention). Java 'synchronized' blocks leverage (behind the scenes) a shared hash-table memory space and thus trigger the undesirable dirty-reads, I haven't determined if java 5 Lock objects avoid this, but you're still leveraging OS stalls which won't help in your throughput. Obviously most OutputStream operations trigger such synchronized calls (and of course are typically filling a common stream buffer). Generally my experience is that single-threading is faster than mulithreading for a common byte-array/object-array, etc. At least with simplistic sorting/filtering algorithms that I've experimented with. This is true both in Java and C in my experience. I haven't tried FPU intesive ops (like divides, sqrt), where cache-lines may be less of a factor. Basically if you're a single CPU you don't have cache-line problems (unless the OS is always flushing the cache even in shared threads), but multithreading buys you less than nothing. In hyperthreading, it's the same deal. In single-CPU shared L2/L3 cache configurations (e.g. AMDs), you might find some benefit. In multi CPU Intel BUS's, forget it - shared write-memory is worse than single-threading.

how can you measure the time spent in a context switch under java platform

1 Answers

You can't easily differentiate the waste due to thread-switching and that due to memory cache contention. You CAN measure the thread contention.. Namely, on linux, you can cat /proc/PID/XXX and get tons of detailed per-thread statistics. HOWEVER, since the pre-emptive scheduler is not going to shoot itself in the foot, you're not going to get more than say 30 ctx switches per second no matter how many threads you use.. And that time is going to be relatively small v.s. the amount of work you're doing.. The real cost of context-switching is the cache pollution. e.g. there is a high probability that you'll have mostly cache misses once you're context-switched back in. Thus OS time and context-switch-counts are of minimal value.

What's REALLY valuable is the ratio of inter-thread cache-line dirties. Depending on the CPU, a cache-line dirty followed by a peer-CPU read is SLOWER than a cache-miss - because you have to force the peer CPU to write it's value to main-mem before you can even start reading.. Some CPUs let you pull from peer cache-lines without hitting main-mem.

So the key is the absolutely minimize ANY shared modified memory structures.. Make everything as read-only as possible.. This INCLUDES share FIFO buffers (including Executor pools).. Namely if you used a synchronized queue - then every sync-op is a shared dirty memory region. And more-over, if the rate is high enough, it'll likely trigger an OS trap to stall, waiting for peer thread's mutex's.

The ideal is to segment RAM, distribute to a fixed number of workers a single large unit of work, then use a count-down-latch or some other memory barrier (such that each thread would only touch it once). Ideally any temporary buffers are pre-allocated instead of going into and out of a shared memory pool (which then causes cache contention). Java 'synchronized' blocks leverage (behind the scenes) a shared hash-table memory space and thus trigger the undesirable dirty-reads, I haven't determined if java 5 Lock objects avoid this, but you're still leveraging OS stalls which won't help in your throughput. Obviously most OutputStream operations trigger such synchronized calls (and of course are typically filling a common stream buffer).

Generally my experience is that single-threading is faster than mulithreading for a common byte-array/object-array, etc. At least with simplistic sorting/filtering algorithms that I've experimented with. This is true both in Java and C in my experience. I haven't tried FPU intesive ops (like divides, sqrt), where cache-lines may be less of a factor.

Basically if you're a single CPU you don't have cache-line problems (unless the OS is always flushing the cache even in shared threads), but multithreading buys you less than nothing. In hyperthreading, it's the same deal. In single-CPU shared L2/L3 cache configurations (e.g. AMDs), you might find some benefit. In multi CPU Intel BUS's, forget it - shared write-memory is worse than single-threading.

127

answered Oct 22 '22 02:10

M. Maraist

Related questions
                            
                                Mongo Schema-less Collections & C#
                            
                                Android MediaPlayer - Sometimes No Video is Played Even Though Audio Plays
                            
                                Including an image using roxygen documentation
                            
                                How to log stdout output in Tomcat?
                            
                                Database Design for storing Chat Messages between people
                            
                                How to get input from Chrome's Javascript console?
                            
                                std::size_t vs size_t vs std::string::size_type
                            
                                Is the C4345 warning of Visual Studio wrong?
                            
                                Tool for automatically check docstring style according to PEP257 [closed]
                            
                                How can I measure image noise
                            
                                Why @OneToOne is allowing duplicate associations?
                            
                                TFS 2010 - Error occurred during copy: Path too long

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how can you measure the time spent in a context switch under java platform

Tags:

SecureFish

People also ask

1 Answers

M. Maraist

Recent Activity

Donate For Us