Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Window in Spark Streaming?

In spark streaming, the DStreams we receive is a batch of RDDs. So how does windowing helps further.

As per my understanding it also batches the RDDs.

Correct me if I am wrong (new to Spark Streaming).

like image 267
dexter Avatar asked Oct 08 '15 08:10

dexter


People also ask

What is window function in Spark?

Window functions are useful for processing tasks such as calculating a moving average, computing a cumulative statistic, or accessing the value of rows given the relative position of the current row.

What is a window duration size in Spark Streaming?

For example if you set batch interval 5 seconds - Spark Streaming will collect data for 5 seconds and then kick out calculation on RDD with that data. window size - it is interval of time in seconds for how much historical data shall be contained in RDD before processing.

What is sliding window Spark?

Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data.


1 Answers

The number of records in one batch is determined by the batch interval. A window will keep the number of batches as fit within the size of a window, that's why the window size must be a multiple of the batch interval. Your operations will then run on multiple batches, and with each new batch the window will move forward, discarding older batches.

The point is that in streaming, data that belongs together doesn't necessarily arrive at the same time, especially at low batch intervals. With windows you are essentially looking back in time.

But note that your job still runs at the specified batch interval, so it will produce output at the same pace as before but look at more data at once. You will also look at the same data multiple times!

There a nice blog post by Michael Noll which explains this in more detail: http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/.

Update:

You can increase your batch interval, but then your job is processing slower as well, i.e. only creating output every 10 seconds instead of 2. You can also put a window on one part of the computation, whereas the batch interval affects everything. Check out reduceByKeyAndWindow for example.

like image 133
Marius Soutier Avatar answered Sep 20 '22 11:09

Marius Soutier