I've encountered a strange behavior in a .NET application that performs some highly parallel processing on a set of in-memory data.
When run on a multi-core processor (IntelCore2 Quad Q6600 2.4GHz) it exhibits non-linear scaling as multiple threads are kicked off to process the data.
When run as a non-multithreaded loop on a single core, the process is able to complete approximately 2.4 million computations per second. When run as four threads you would expect four times as much throughput - somewhere in the neighborhood of 9 million computations per second - but alas, no. In practice it only completes about 4.1 millions per second ... quite a bit short from the expected throughput.
Furthermore, the behavior occurs no matter whether I use PLINQ, a thread pool, or four explicitly created threads. Quite strange...
Nothing else is running on the machine using CPU time, nor are there any locks or other synchronization objects involved in the computation ... it should just tear ahead through the data. I've confirmed this (to the extent possible) by looking at perfmon data while the process runs ... and there are no reported thread contentions or garbage collection activity.
My theories at the moment:
Below is a representative excerpt from the code that should exhibit the same behavior:
var evaluator = new LookupBasedEvaluator();
// find all ten-vertex polygons that are a subset of the set of points
var ssg = new SubsetGenerator<PolygonData>(Points.All, 10);
const int TEST_SIZE = 10000000; // evaluate the first 10 million records
// materialize the data into memory...
var polygons = ssg.AsParallel()
.Take(TEST_SIZE)
.Cast<PolygonData>()
.ToArray();
var sw1 = Stopwatch.StartNew();
// for loop completes in about 4.02 seconds... ~ 2.483 million/sec
foreach( var polygon in polygons )
evaluator.Evaluate(polygon);
s1.Stop();
Console.WriteLine( "Linear, single core loop: {0}", s1.ElapsedMilliseconds );
// now attempt the same thing in parallel using Parallel.ForEach...
// MS documentation indicates this internally uses a worker thread pool
// completes in 2.61 seconds ... or ~ 3.831 million/sec
var sw2 = Stopwatch.StartNew();
Parallel.ForEach(polygons, p => evaluator.Evaluate(p));
sw2.Stop();
Console.WriteLine( "Parallel.ForEach() loop: {0}", s2.ElapsedMilliseconds );
// now using PLINQ, er get slightly better results, but not by much
// completes in 2.21 seconds ... or ~ 4.524 million/second
var sw3 = Stopwatch.StartNew();
polygons.AsParallel(Environment.ProcessorCount)
.AsUnordered() // no sure this is necessary...
.ForAll( h => evalautor.Evaluate(h) );
sw3.Stop();
Console.WriteLine( "PLINQ.AsParallel.ForAll: {0}", s3.EllapsedMilliseconds );
// now using four explicit threads:
// best, still short of expectations at 1.99 seconds = ~ 5 million/sec
ParameterizedThreadStart tsd = delegate(object pset) { foreach (var p in (IEnumerable<Card[]>) pset) evaluator.Evaluate(p); };
var t1 = new Thread(tsd);
var t2 = new Thread(tsd);
var t3 = new Thread(tsd);
var t4 = new Thread(tsd);
var sw4 = Stopwatch.StartNew();
t1.Start(hands);
t2.Start(hands);
t3.Start(hands);
t4.Start(hands);
t1.Join();
t2.Join();
t3.Join();
t4.Join();
sw.Stop();
Console.WriteLine( "Four Explicit Threads: {0}", s4.EllapsedMilliseconds );
Take a look at this article: http://blogs.msdn.com/pfxteam/archive/2008/08/12/8849984.aspx
Specifically, limit memory allocations in the parallel region, and carefully inspect writes to make sure that they don't occur close to memory locations that other threads read or write.
So I finally figured out what the problem was - and I think it would be useful to share it with the SO community.
The entire issue with non-linear performance was the result of a single line inside the Evaluate()
method:
var coordMatrix = new long[100];
Since Evaluate()
is invoked millions of times, this memory allocation was occurring millions of times. As it happens, the CLR internally performs some inter-thread synchronization when allocating memory - otherwise allocation on multiple threads could inadvertently overlap. Changing the array from a method-local instance to a class instance that is only allocated once (but then initializing in a method-local loop) eliminated the scalability problem.
Normally, it's an antipattern to create a class-level member for a variable that is only used (and meaningful) within the scope of a single method. But in this case, since I need the greatest possible scalability, I will live with (and document) this optimization.
Epilogue: After I made this change, the concurrent process was able to achieve 12.2 million computations / sec.
P.S. Kudos to Igor Ostrovsky for his germane link to MSDN blogs which helped me identify and diagnose the problem.
Non-linear scaling is to be expected with a parallel algorithm in comparison with a sequential algorithm, since there is some inherent overhead in the parallelization. ( Ideally, of course, you want to get as close as you can.)
Additionally, there will usually be certain things you need to take care of in a parallel algorithm that you don't need to in a sequential algorithm. Beyond synchronization (which can really bog down your work), there are some other things that can happen:
As far as I can tell, your current explicit approach uses a shared iterator between the threads. That's an okay solution if the processing vary wildly throughout the array, but there is likely to be synchronization overhead to prevent an element from being skipped (retrieving the current element and moving the internal pointer to the next element needs to be an atomic operation to prevent skipping an element).
Therefore, it might be a better idea to partition the array, assuming the processing time of each element is expected to be roughly equal regardless of the position of the element. Given that you have 10 million records, that means telling thread 1 to work on elements 0 to 2,499,999, thread 2 works on elements 2,500,000 to 4,999,999, etc. You can assign each thread an ID and use this to calculate the actual range.
Another small improvement would be to let the main thread act as one of the threads that calculates. However, if I remember correctly, that's a very minor thing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With