I am new spark streaming. I understood window size needs to be a multiple of the batch interval. But how does the sliding interval work? If i have 3 as window size and 2 as sliding interval, wouldn't there be a overlap when i calculate say word counts? Or should the sliding interval and batch interval should be the same?
Here is a link to a documentation.
Let's walk through these concepts:
You can refer to image above where window size is 3 times of batch interval and sliding window is 2 times of batch interval.
To answer a question why window and sliding intervals shall be multiple of batch interval - it is because otherwise your window will end inbetween batch.
If you have 3 as window size and 2 as sliding interval (see image) - yes, your word count will overlap. Basically you use window when you want to calculate something for some limited time - like actual news or tweets or whatever, when you don't need all historical data for the analysis.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With