I have a spark streaming context reading event data from kafka at 10 sec intervals. I would like to complement this event data with the existent data at a postgres table.
I can load the postgres table with something like:
val sqlContext = new SQLContext(sc)
val data = sqlContext.load("jdbc", Map(
"url" -> url,
"dbtable" -> query))
...
val broadcasted = sc.broadcast(data.collect())
And later I can cross it like this:
val db = sc.parallelize(data.value)
val dataset = stream_data.transform{ rdd => rdd.leftOuterJoin(db)}
I would like to keep my current datastream running and still reload this table every other 6 hours. Since apache spark at the moment doesn't support multiple running contexts how can I accomplish this? Is there any workaround? Or will I need to restart the server each time I want to reload the data? This seems such a simple use case... :/
In my humble opinion, reloading another data source during the transformations on DStreams is not recommended by design.
Compared to traditional stateful
streaming processing models, D-Streams is designed to structure a streaming computation as a series of stateless
, deterministic
batch computations on small time intervals.
The transformations on DStreams are deterministic and this design enable the quick recover from faults by recomputing. The refreshing will bring side-effect to recovering/recomputing.
One workaround is to postpone the query to output operations for example: foreachRDD(func)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With