I'd be really appreciated if someone with good experience of Intel VTune Amplifier tell me about this thing.
Recently I received performance analysis report from other guys who used Intel VTune Amplifier against my program. It tells, there is high overhead time in the thread concurrency area.
What's the meaning of the Overhead Time? They don't know (asked me), I don't have access to Intel VTune Amplifier.
I have vague ideas. This program has many thread sleep calls because pthread condition
is unstable (or I did badly) in the target platform so I change many routines to do works in the loop look like below:
while (true)
{
mutex.lock();
if (event changed)
{
mutex.unlock();
// do something
break;
}
else
{
mutex.unlock();
usleep(3 * 1000);
}
}
This can be flagged as Overhead Time?
Any advice?
I found help documentation about Overhead Time from Intel site. http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/win/ug_docs/olh/common/overhead_time.html#overhead_time
Excerpt:
Overhead time is a duration that starts with the release of a shared resource and ends with the receipt of that resource. Ideally, the duration of Overhead time is very short because it reduces the time a thread has to wait to acquire a resource. However, not all CPU time in a parallel application may be spent on doing real pay load work. In cases when parallel runtime (Intel® Threading Building Blocks, OpenMP*) is used inefficiently, a significant portion of time may be spent inside the parallel runtime wasting CPU time at high concurrency levels. For example, this may result from low granularity of work split in recursive parallel algorithms: when the workload size becomes too low, the overhead on splitting the work and performing the housekeeping work becomes significant.
Still confusing.. Could it mean "you made unnecessary/too frequent lock"?
I am also not much of an expert on that, though I have tried to use pthread
a bit myself.
To demonstrate my understanding of overhead time, let us take the example of a simple single-threaded program to compute an array sum:
for(i=0;i<NUM;i++) {
sum += array[i];
}
In a simple [reasonably done] multi-threaded version of that code, the array could be broken into one piece per thread, each thread keeps its own sum, and after the threads are done, the sums are summed.
In a very poorly written multi-threaded version, the array could be broken down as before, and every thread could atomicAdd
to a global sum.
In this case, the atomic addition can only be done by one thread at a time. I believe that overhead time is a measure of how long all of the other threads spend while waiting to do their own atomicAdd
(you could try writing this program to check if you want to be sure).
Of course, it also takes into account the time it takes to deal with switching the semaphores and mutexes around. In your case, it probably means a significant amount of time is spent on the internals of the mutex.lock and mutex.unlock.
I parallelized a piece of software a while ago (using pthread_barrier
), and had issues where it took longer to run the barriers than it did to just use one thread. It turned out that the loop that had to have 4 barriers in it was executed quickly enough to make the overhead not worth it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With