Spark Streaming: How to periodically refresh cached RDD?

Tags:

In my Spark streaming application, I want to map a value based on a dictionary that's retrieved from a backend (ElasticSearch). I want to periodically refresh the dictionary periodically, in case it was updated in the backend. It would be similar to Logstash translate filter's periodic refresh capability. How could I achieve this with Spark (e.g. somehow unpersist the RDD every 30 seconds)?

751

asked Jun 05 '16 04:06

lairtech

1 Answers

The best way I've found to do that is to recreate the RDD and maintain a mutable reference to it. Spark Streaming is at its core an scheduling framework on top of Spark. We can piggy-back on the scheduler to have the RDD refreshed periodically. For that, we use an empty DStream that we schedule only for the refresh operation:

def getData():RDD[Data] = ??? function to create the RDD we want to use af reference data
val dstream = ??? // our data stream

// a dstream of empty data
val refreshDstream = new  ConstantInputDStream(ssc, sparkContext.parallelize(Seq())).window(Seconds(refreshInterval),Seconds(refreshInterval))

var referenceData = getData()
referenceData.cache()
refreshDstream.foreachRDD{_ => 
    // evict the old RDD from memory and recreate it
    referenceData.unpersist(true)
    referenceData = getData()
    referenceData.cache()
}

val myBusinessData = dstream.transform(rdd => rdd.join(referenceData))
... etc ...

In the past, I've also tried only with interleaving cache() and unpersist() with no result (it refreshes only once). Recreating the RDD removes all lineage and provides a clean load of the new data.

answered Sep 29 '22 02:09

maasg

Related questions
                            
                                spark-redshift takes a lot of time to write to redshift
                            
                                PySpark: spit out single file when writing instead of multiple part files
                            
                                Spark: Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space
                            
                                How to create a z-score in Spark SQL for each group
                            
                                Spark 2.0.0 reading json data with variable schema
                            
                                Do stages in an application run parallel in spark?
                            
                                Spark Parquet Statistics(min/max) integration
                            
                                How to convert a column in H2OFrame to a python list?
                            
                                convert dataframe to libsvm format
                            
                                Why dataset.count() is faster than rdd.count()?
                            
                                Spark job just hangs with large data
                            
                                Development with Apache Spark
                            
                                scala code throw exception in spark
                            
                                merge multiple small files in to few larger files in Spark
                            
                                How to read a zip containing multiple files in Apache Spark
                            
                                How to open Spark UI when working on a server?
                            
                                Elegant Json flatten in Spark [duplicate]
                            
                                Spark's Column.isin function does not take List
                            
                                Spark job execution time
                            
                                How to use Plotly with Zeppelin

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Streaming: How to periodically refresh cached RDD?

Tags:

apache-spark

spark-streaming

lairtech

People also ask

1 Answers

maasg

Recent Activity

Donate For Us