Spark streaming multiple sources, reload dataframe

Question

I have a spark streaming context reading event data from kafka at 10 sec intervals. I would like to complement this event data with the existent data at a postgres table.

I can load the postgres table with something like:

val sqlContext = new SQLContext(sc)
val data = sqlContext.load("jdbc", Map(
  "url" -> url,
  "dbtable" -> query))

...

val broadcasted = sc.broadcast(data.collect())

And later I can cross it like this:

val db = sc.parallelize(data.value)
val dataset = stream_data.transform{ rdd => rdd.leftOuterJoin(db)}

I would like to keep my current datastream running and still reload this table every other 6 hours. Since apache spark at the moment doesn't support multiple running contexts how can I accomplish this? Is there any workaround? Or will I need to restart the server each time I want to reload the data? This seems such a simple use case... :/

Shawn Guo · Accepted Answer

In my humble opinion, reloading another data source during the transformations on DStreams is not recommended by design.

Compared to traditional stateful streaming processing models, D-Streams is designed to structure a streaming computation as a series of stateless, deterministic batch computations on small time intervals.

The transformations on DStreams are deterministic and this design enable the quick recover from faults by recomputing. The refreshing will bring side-effect to recovering/recomputing.

One workaround is to postpone the query to output operations for example: foreachRDD(func).

Spark streaming multiple sources, reload dataframe

Tags:

postgresql

apache-spark

apache-spark-sql

spark-streaming

user838681

1 Answers

Shawn Guo

Recent Activity

Donate For Us

Spark streaming multiple sources, reload dataframe

Tags:

postgresql

apache-spark

apache-spark-sql

spark-streaming

user838681

1 Answers

Shawn Guo

Related questions

Recent Activity

Donate For Us