I have a serial(nonparallel) application written in C. I have modified and re-written it using Intel Threading Building Blocks. When I run this parallel version on an AMD Phenom II machine which is a quad-core machine, I get a performance gain of more than 4X which conflicts with the Amdahl's law. Can anyone give me a reason why this is happening?
Thanks, Rakesh.
If you rewrite the program, you could make it more efficient. Amdahl's law only limits the amount of speedup due to parallelism, not how much faster you can make your code by improving it.
You could be realizing the effects of having 4x the cache, since now you can use all four procs. Or maybe having less contention with other processes running on your machine. Or you accidentally fixed a mispredicted branch.
TL/DR: it happens.
It's known as "super linear speedup", and can occur for a variety of reasons, though the most common root cause is probably cache behaviour. Usually when superlinear speedup occurs, it's a clue that you could make the sequential version more efficient.
For example, suppose you have a processor where some of the cores share an L2 cache (a common architecture these days), and suppose your algorithm makes multiple traversals of a large data structure. If you perform the traversals in sequence, then each traversal will have to pull the data into the L2 cache afresh, whereas if you perform the traversals in parallel then you may well avoid a large number of those misses, as long as the traversals run in step (getting out of step is a good source of unpredictable performance here). To make the sequential verison more efficient you could interleave the traversals, thereby improving locality.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With