Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between mini-batch vs real time streaming in practice (not theory)?

Tags:

What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more like do something as the data arrives but my biggest question is why not have mini batch with epsilon time frame (say one millisecond) or I would like to understand reason why one would be an effective solution than other?

I recently came across one example where mini-batch (Apache Spark) is used for Fraud detection and real time streaming (Apache Flink) used for Fraud Prevention. Someone also commented saying mini-batches would not be an effective solution for fraud prevention (since the goal is to prevent the transaction from occurring as it happened) Now I wonder why this wouldn't be so effective with mini batch (Spark) ? Why is it not effective to run mini-batch with 1 millisecond latency? Batching is a technique used everywhere including the OS and the Kernel TCP/IP stack where the data to the disk or network are indeed buffered so what is the convincing factor here to say one is more effective than other?

like image 410
user1870400 Avatar asked Sep 27 '16 04:09

user1870400


1 Answers

Disclaimer: I'm a committer and PMC member of Apache Flink. I'm familiar with the overall design of Spark Streaming but do not know its internals in detail.

The mini-batch stream processing model as implemented by Spark Streaming works as follows:

  • Records of a stream are collected in a buffer (mini-batch).
  • Periodically, the collected records are processed using a regular Spark job. This means, for each mini-batch a complete distributed batch processing job is scheduled and executed.
  • While the job runs, the records for the next batch are collected.

So, why is it not effective to run a mini-batch every 1ms? Simply because this would mean to schedule a distributed batch job every millisecond. Even though Spark is very fast in scheduling jobs, this would be a bit too much. It would also significantly reduce the possible throughput. Batching techniques used in OSs or TCP do also not work well if their batches become too small.

like image 177
Fabian Hueske Avatar answered Sep 21 '22 09:09

Fabian Hueske