Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple windows of different durations in Spark Streaming application

I would like to process a real-time stream of data (from Kafka) using Spark Streaming. I need to compute various stats from the incoming stream and they need to be computed for windows of varying durations. For example, I might need to compute the avg value of a stat 'A' for the last 5 mins while at the same time compute the median for stat 'B' for the last 1 hour.

In this case, what's the recommended approach to using Spark Streaming? Below are a few options I could think of:

(i) Have a single DStream from Kafka and create multiple DStreams from it using the window() method. For each of these resulting DStreams, the windowDuration would be set to different values as required. eg:

// pseudo-code
val streamA = kafkaDStream.window(Minutes(5), Minutes(1))
val streamB = kafkaDStream.window(Hours(1), Minutes(10))

(ii) Run separate Spark Streaming apps - one for each stat

Questions

To me (i) seems like a more efficient approach. However, I have a couple of doubts regarding that:

  • How would streamA and streamB be represented in the underlying datastructure.
  • Would they share data - since they originate from the KafkaDStream? Or would there be duplication of data?
  • Also, are there more efficient methods to handle such a use case.

Thanks in advance

like image 305
jithinpt Avatar asked Jul 21 '15 21:07

jithinpt


1 Answers

Your (i) streams look sensible, will share data, and you can look at WindowedDStream to get an idea of the underlying representation. Note your streams are of course lazy, so only the batches being computed upon are in the system at any given time.

Since the state you have to maintain for the computation of an average is small (2 numbers), you should be fine. I'm more worried about the median (which requires a pair of heaps).

One thing you haven't made clear, though, is if you really need the update component of your aggregation that is implied by the windowing operation. Your streamA maintains the last 5 minutes of data, updated every minute, and streamB maintains the last hour updated every 10 minutes.

If you don't need that freshness, not requiring it will of course should minimize the amount of data in the system. You can have a streamA with a batch interval of 5mins and a streamB which is deducted from it (with window(Hours(1)), since 60 is a multiple of 5) .

like image 145
Francois G Avatar answered Oct 19 '22 23:10

Francois G