Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PLINQ vs Tasks vs Async vs Producer/Consumer queue? What to use?

I was reading C# 5.0 in nutshell and after reading author's view(s), I am quite confused as to what should I adopt. My requirement is that say I have a really long running (computationally heavy) task, say for example, calculate SHA1 (or some other) hash of millions of file, or really any other thing is is computationally heavy and is likely to take some time, what should be my approach toward developing it (in winforms if that matters, using VS 2012, C# 5.0), so that I can also report progress to the user.

Following scenario(s) come to mind...

  1. Create a Task (with LongRunning option that computes the hashes and report the progress to user either by implementing IProgess<T> or Progess<T> or letting the task capture the SynchronizationContext context and posting to the UI.

  2. Create a Async method like

     async CalculateHashesAsync() 
     {
         // await here for tasks the calculate the hash
         await Task.Rung(() => CalculateHash();
        // how do I report progress???
     }
    
  3. Use TPL (or PLINQ) as

    void CalcuateHashes()  
    {  
        Parallel.For(0, allFiles.Count, file => calcHash(file)    
        // how do I report progress here?   
    }
    
  4. Use a producer / consumer Queue.
    Don't really know how?

The author in the book says...

Running one long running task on a pooled thread won't cause trouble. It's when you run multiple long running tasks in parallel (particularly ones that block) that performance can suffer. In that case, there are usually better solutions than TaskCreationOptions.LongRunnging

  • If tasks are IO bound, TaskCompletionSource and asynchronous functions let you implement concurrency with callbacks instead of threads.
  • If tasks are compute bound, a producer/consumer queue lets you throttle the concurrency for those tasks, avoiding starvation for other threads and process.

About the Producer/Consumer the author says...

A producer/consumer queue is a useful structure, both in parallel programming and general concurrency scenarios as it gives you precise control over how many worker threads execute at once, which is useful not only in limiting CPU consumption, but other resources as well.

So, should I not use task, meaning that first option is out? Is second one the best option? Are there any other options? And If I were to follow author's advice, and implement a producer/consumer, how would I do that (I don't even have an idea of how to get started with producer/consumer in my scenario, if that is the best approach!)

I'd like to know if someone has ever come across such a scenario, how would they implement? If not, what would be the most performance effective and/or easy to develop/maintain (I know the word performance is subjective, but let's just consider the very general case that it works, and works well!)

like image 882
Razort4x Avatar asked Jul 04 '13 14:07

Razort4x


1 Answers

really long running (computationally heavy) task, say for example, calculate SHA1 (or some other) hash of millions of file

That example clearly has both heavy CPU (hashing) and I/O (file) components. Perhaps this is a non-representative example, but in my experience even a secure hash is far faster than reading the data from disk.

If you just have CPU-bound work, the best solution is either Parallel or PLINQ. If you just have I/O-bound work, the best solution is to use async. If you have a more realistic and complex scenario (with both CPU and I/O work), then you should either hook up your CPU and I/O parts with producer/consumer queues or use a more complete solution such as TPL Dataflow.

TPL Dataflow works well with both parallel (MaxDegreeOfParallelism) and async, and has a builtin producer/consumer queue in-between each block.

One thing to keep in mind when mixing massive amounts of I/O and CPU usage is that different situations can cause massively different performance characteristics. To be safe, you'll want to throttle the data going through your queues so you won't end up with memory usage issues. TPL Dataflow has built-in support for throttling via BoundedCapacity.

like image 97
Stephen Cleary Avatar answered Sep 19 '22 00:09

Stephen Cleary