Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark streaming multiple sources, reload dataframe

I have a spark streaming context reading event data from kafka at 10 sec intervals. I would like to complement this event data with the existent data at a postgres table.

I can load the postgres table with something like:

val sqlContext = new SQLContext(sc)
val data = sqlContext.load("jdbc", Map(
  "url" -> url,
  "dbtable" -> query))

...

val broadcasted = sc.broadcast(data.collect())

And later I can cross it like this:

val db = sc.parallelize(data.value)
val dataset = stream_data.transform{ rdd => rdd.leftOuterJoin(db)}

I would like to keep my current datastream running and still reload this table every other 6 hours. Since apache spark at the moment doesn't support multiple running contexts how can I accomplish this? Is there any workaround? Or will I need to restart the server each time I want to reload the data? This seems such a simple use case... :/

like image 531
user838681 Avatar asked May 13 '15 13:05

user838681


1 Answers

In my humble opinion, reloading another data source during the transformations on DStreams is not recommended by design.

Compared to traditional stateful streaming processing models, D-Streams is designed to structure a streaming computation as a series of stateless, deterministic batch computations on small time intervals.

The transformations on DStreams are deterministic and this design enable the quick recover from faults by recomputing. The refreshing will bring side-effect to recovering/recomputing.

One workaround is to postpone the query to output operations for example: foreachRDD(func).

like image 140
Shawn Guo Avatar answered Nov 13 '22 10:11

Shawn Guo