I have a unit of work on a .NET 4.51, C# Web Service that takes 100 milliseconds. Usually the web request contains 10 units of work or so. Thus processing it sequentially via a for loop takes about a second.
foreach (var u in unitsOfWork) {
Run(u);
}
Since the box has 12 CPUs, I decided to split up the work and run it in parallel hoping to get a performance gain. I used Parallel.ForEach to do the work:
Parallel.ForEach(unitsOfWork,u => {
Run(u);
});
To my surprise, each unit of work took on average 425 milliseconds. So in the end I saved about 500 milliseconds off the request. It seems like I should be able to get better performance seeing how the box has 12 CPUs... Am I missing something simple?
I looked for anything that is shared (that could be holding it up), but found nothing...so I tried to experiment. I sent a request with 2 units of work and each took about 125 ms. With 3 requests each unit took 150 ms and so on. With each subsequent number of units, there was a penalty of around 25 to 30 ms.
So either I am doing something wrong... or there is just inherent overhead to multi-threading (didn't realize it was this big).
P.S. I also tried replacing Parallel.For with Thread.Join - same results.
The theoretical speedup that you could achieve is governed by Amdahl's law:

where T(1) is the single-threaded speed, n is the number of CPUs, and B is the percentage of the task that cannot be serialized. The overhead of starting up a new task is considered to be zero by this formula.
If your task were perfectly parallelizable, B would be zero, and you would complete the task in 1/12-th of the time. However, even a modest B of, say, 20%, would limit the highest potential speedup with 12 CPUs to only 3.75 times - a little over a third of the theoretical limit of 12 times.
Things that cannot be parallelized include serialized access to shared resources, such as I/O and waiting on completion of other tasks.
Dealing with cache contentions makes matters even worse: when concurrent tasks are accessing different regions of memory, they kick the data of each other out of hardware cache, which amounts to increasing B in the formula above.
To summarize, your observation is not uncommon, and you are not missing anything. Achieving a theoretically possible sppedup is very hard, and the actual speedup that you achieve depends on the tasks that your parallel program needs to run.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With