I have been using PLINQ recently to perform some data handling.
Basically I have about 4000 time series (so basically instances of Dictionary<DataTime,T>
) which I stock in a list called timeSeries
.
To perform my operation, I simply do:
timeSeries.AsParallel().ForAll(x=>myOperation(x))
If I have a look at what is happening with my different cores, I notice that first, all my CPUs are being used and I see on the console (where I output some logs) that several time series are processed at the same time.
However, the process is lengthy, and after about 45 minutes, the logging clearly indicates that there is only one thread working. Why is that?
I tried to give it some thought, and I realized that timeSeries
contains instances simpler to process from myOperation
's point of view at the beginning and the end of the list. So, I wondered if maybe the algorithm that PLINQ was using consisted in splitting the 4000 instances on, say, 4 cores, giving each of them 1000. Then, when the core is finished with its allocation of work, it goes back to idle. This would mean that one of the core may be facing a much heavier workload.
Is my theory correct or is there another possible explanation?
Shall I shuffle my list before running it or is there some kind of parallelism parameters I can use to fix that problem?
Your theory is probably correct although there is something called 'workstealing' that should counter this. I'm not sure why that doesn't work here. Are there many (>= dozens) large jobs at the outer ends or just a few?
Aside from shuffling your data you could use the overload for AsParallel()
that accepts a custom Partioner. That would allow you to balance the work better.
Side note: for this situation I would prefer Parallel.ForEach()
, more options and cleaner syntax.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With