Why does Intel Threading Building Blocks (TBB) parallel_for
have such a large overhead? According to section 3.2.2 Automatic Chunking in the Tutorial.pdf
its around half a millisecond. This is an exert from the tutorial:
CAUTION: Typically a loop needs to take at least a million clock cycles for parallel_for to improve its performance. For example, a loop that takes at least 500 microseconds on a 2 GHz processor might benefit from parallel_for.
From what I have read so far TBB uses the threadpool (pool of worker threads) pattern internally and it prevents such bad overheads by only spawning worker threads once initially (which costs hundreds of microseconds).
So what is taking the time? Data synchronization using mutexes isn't that slow right? Besides doesn't TBB make use of lock-free data structures for synchronization?
From what I have read so far TBB uses the threadpool (pool of worker threads) pattern internally and it prevents such bad overheads by only spawning worker threads once initially (which costs hundreds of microseconds).
Yes, TBB pre-allocates threads. It doesn't physically create and join worker threads whenever it sees parallel_for
. OpenMP and other parallel libraries all do pre-allocation.
But, there is still overhead to wake up threads from the pool, and dispatch logical tasks to the threads. Yes, TBB exploits lock-free data structures to minimize overhead, but it still requires some amount of parallel overhead (i.e., serial part). That's why TBB manual advise to avoid very short loops.
In general, you must have a sufficient job to gain parallel speedup. I think even a 1 millisecond (=1,000 microseconds) are too small. From my experience, in order to see meaningful speedup, I needed to increase execution time around 100 milliseconds.
If the parallel overhead of TBB parallel_for
is really a concern to you, it might be worthy to try a simple static scheduling. I don't have a good knowledge of TBB's static scheduling implementation. But, you can easily try on OpenMP's one: omp parallel for schedule(static)
. I believe this overhead would be the minimal cost in parallel for. However, since it's using a static scheduling, the benefit from dynamic scheduling (especially when work loads are not homogeneous) will be lost.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With