Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I/O performance - async vs TPL vs Dataflow vs RX

I have a piece of C# 5.0 code that generates a ton of network and disk I/O. I need to run multiple copies of this code in parallel. Which of the following technologies is likely to give me the best performance:

  • async methods with await

  • directly use Task from TPL

  • the TPL Dataflow nuget

  • Reactive Extensions

I'm not very good at this parallel stuff, but if using a lower lever, like say Thread, can give me a lot better performance I'd consider that too.

like image 935
Eliezer Kohen Avatar asked Apr 17 '13 01:04

Eliezer Kohen


3 Answers

This is like trying to optimize the length of your transatlantic flight by asking the quickest method to remove your seatbelt.

Ok, some real advice, since I was kind of a jerk

Let's give a helpful answer. Think of performance as in "Classes" of activities - each one is an order of magnitude slower (at least!):

  1. Only accessing the CPU, very little memory usage (i.e. rendering very simple graphics to a very fast GPU, or calculating digits of Pi)
  2. Only accessing CPU and in-memory things, nothing on disk (i.e. a well-written game)
  3. Accessing the disk
  4. Accessing the network.

If you do even one of activity #3, there's no point in doing optimizations typical to activities #1 and #2 like optimizing threading libraries - they're completely overshadowed by the disk hit. Same for CPU tricks - if you're constantly incurring L2/L3 cache misses, sparing a few CPU cycles by hand-writing assembly isn't worth it (which is why things like loop unrolling are usually a bad idea these days).

So, what can we derive from this? There are two ways to make your program faster, either move up from #3 to #2 (which isn't often possible, depending on what you're doing), or by doing less I/O. I/O and network speed is the rate-limiting factor in most modern applications, and that's what you should be trying to optimize.

like image 109
Ana Betts Avatar answered Nov 16 '22 22:11

Ana Betts


Any performance difference between these options would be inconsequential in the face of "a ton of network and disk I/O".

A better question to ask is "which option is easiest to learn and develop with?" Or "which option would be best to maintain this code with five years from now?" And for that I would suggest async first, or Dataflow or Rx if your logic is better represented as a stream.

like image 39
Stephen Cleary Avatar answered Nov 16 '22 22:11

Stephen Cleary


It's an older question, but for anyone reading this...

It depends. If you try to saturate 1Gbps link with 50B messages, you will be CPU bound even with simple non-blocking send over raw sockets. If, on the other hand, you are happy with 1Mbps throughput or your messages are larger than 10KB, any of these frameworks will do the job.

For low-bandwidth situations, I would recommend to prioritize by ease of use, i.e. async/await, Dataflow, Rx, TPL in this order. Note that high-bandwidth application should be prototyped as if it is low-bandwidth and optimized later.

For true high-bandwidth application, I can recommend Dataflow over Rx, because Rx is not designed for high concurrency. Raw TPL is the bottom layer, which guarantees the lowest overhead if you can handle the complexity. If you can make efficient use of dedicated threads, then that would be even faster. Async/await vs. Dataflow IMO doesn't make any performance difference. The overhead seems comparable, so choose one that's a better fit.

like image 27
Robert Važan Avatar answered Nov 16 '22 23:11

Robert Važan