Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between batch interval, sliding interval and window size in spark streaming

I am new spark streaming. I understood window size needs to be a multiple of the batch interval. But how does the sliding interval work? If i have 3 as window size and 2 as sliding interval, wouldn't there be a overlap when i calculate say word counts? Or should the sliding interval and batch interval should be the same?

like image 813
Hariprasath Thiagarajan Avatar asked Jun 04 '18 06:06

Hariprasath Thiagarajan


Video Answer


1 Answers

Here is a link to a documentation.

enter image description here

Let's walk through these concepts:

  1. batch interval - it is time in seconds how long data will be collected before dispatching processing on it. For example if you set batch interval 5 seconds - Spark Streaming will collect data for 5 seconds and then kick out calculation on RDD with that data.
  2. window size - it is interval of time in seconds for how much historical data shall be contained in RDD before processing. For example you have 1 second batch interval and window size of 2 - in that case you will have calculation kicked out each second for 2 previous batches. E.g at time=3 you will have data from batch at time=2 and time=3.
  3. sliding interval - is amount of time in seconds for how much the window will shift. For example in previous example sliding interval is 1 (since calculation is kicked out each second) e.g. at time=1, time=2, time=3... if you set sliding interval=2, you will get calculation at time=1, time=3, time=5...

You can refer to image above where window size is 3 times of batch interval and sliding window is 2 times of batch interval.

To answer a question why window and sliding intervals shall be multiple of batch interval - it is because otherwise your window will end inbetween batch.

If you have 3 as window size and 2 as sliding interval (see image) - yes, your word count will overlap. Basically you use window when you want to calculate something for some limited time - like actual news or tweets or whatever, when you don't need all historical data for the analysis.

like image 163
Vladislav Varslavans Avatar answered Sep 22 '22 13:09

Vladislav Varslavans