Spark streaming data sharing between batches

Tags:

spark-streaming

Spark streaming processes the data in micro batches.

Each interval data is processed in parallel using RDDs with out any data sharing between each interval.

But my use case needs to share the data between intervals.

Consider the Network WordCount example which produces the count of all words received in that interval.

How would I produce following word count ?

Relative count for the words "hadoop" and "spark" with the previous interval count
Normal word count for all other words.

Note: UpdateStateByKey does the Stateful processing but this applies function on every record instead of particular records.

So, UpdateStateByKey doesn't fit for this requirement.

Update:

consider the following example

Interval-1

Input:

Sample Input with Hadoop and Spark on Hadoop

output:

hadoop  2
sample  1
input   1
with    1
and 1
spark   1
on  1

Interval-2

Input:

Another Sample Input with Hadoop and Spark on Hadoop and another hadoop another spark spark

output:

another 3
hadoop  1
spark   2
and 2
sample  1
input   1
with    1
on  1

Explanation:

1st interval gives the normal word count of all words.

In the 2nd interval hadoop occurred 3 times but the output should be 1 (3-2)

Spark occurred 3 times but the output should be 2 (3-1)

For all other words it should give the normal word count.

So, while processing 2nd Interval data, it should have the 1st interval's word count of hadoop and spark

This is a simple example with illustration.

In actual use case, fields that need data sharing are part of the RDD element(RDD) and huge no of values needs to be tracked.

i.e, in this example like hadoop and spark keywords nearly 100k keywords to be tracked.

Similar usecases in Apache Storm:

Distributed caching in storm

Storm TransactionalWords

567

asked May 05 '15 09:05

1 Answers

This is possible by "remembering" the last RDD received and using a left join to merge that data with the next streaming batch. We make use of streamingContext.remember to enable RDDs produced by the streaming process to be kept for the time we need them.

We make use of the fact that dstream.transform is an operation that executes on the driver and therefore we have access to all local object definitions. In particular we want to update the mutable reference to the last RDD with the required value on each batch.

Probably a piece of code makes that idea more clear:

// configure the streaming context to remember the RDDs produced
// choose at least 2x the time of the streaming interval
ssc.remember(xx Seconds)  

// Initialize the "currentData" with an empty RDD of the expected type
var currentData: RDD[(String, Int)] = sparkContext.emptyRDD

// classic word count
val w1dstream = dstream.map(elem => (elem,1))    
val count = w1dstream.reduceByKey(_ + _)    

// Here's the key to make this work. Look how we update the value of the last RDD after using it. 
val diffCount = count.transform{ rdd => 
                val interestingKeys = Set("hadoop", "spark")               
                val interesting = rdd.filter{case (k,v) => interestingKeys(k)}                                
                val countDiff = rdd.leftOuterJoin(currentData).map{case (k,(v1,v2)) => (k,v1-v2.getOrElse(0))}
                currentData = interesting
                countDiff                
               }

diffCount.print()

answered Oct 12 '22 15:10

maasg

Related questions
                            
                                Understanding Spark Structured Streaming Parallelism
                            
                                _pickle.PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects
                            
                                Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each
                            
                                Error when converting from spark dataframe with dates to pandas dataframe
                            
                                Use spark-submit to submit a application to EC2 cluster
                            
                                Spark with Cassandra input/output
                            
                                Increase memory available to Spark shell
                            
                                How to transform a categorical variable in Spark into a set of columns coded as {0,1}?
                            
                                Geoip2's python library doesn't work in pySpark's map function
                            
                                Spark ml and PMML export
                            
                                Why are Spark Parquet files for an aggregate larger than the original?
                            
                                How to write null value from Spark sql expression of DataFrame to a database table? (IllegalArgumentException: Can't get JDBC type for null)
                            
                                Missing hive-site when using spark-submit YARN cluster mode
                            
                                AWS connection timeout when running Spark job on EMR
                            
                                Spark - how to get top N of rdd as a new rdd (without collecting at the driver)
                            
                                Apache Livy doesn't work with local jar file
                            
                                RDD CountApproximate taking far longer than requested timeout
                            
                                Limit kafka batch size when using Spark Structured Streaming
                            
                                RDD filter in scala spark
                            
                                pySpark Create DataFrame from RDD with Key/Value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark streaming data sharing between batches

Tags:

apache-spark

spark-streaming

Vijay Innamuri

People also ask

1 Answers

maasg

Recent Activity

Donate For Us