I have Spark Structured Streaming Job that reads from S3, transforms the data and then store it to one S3 sink and one Elasticsearch sink.
Currently, I am doing readStream
once and then writeStream.format("").start()
twice. When doing so it seems that Spark read the data twice from S3 source, once per each sink.
Is there a more efficient way to write to multiple sinks in the same pipeline?
Streaming – Complete Output Mode This mode is used only when you have streaming aggregated data. One example would be counting the words on streaming data and aggregating with previous data and output the results to sink.
Use readStream. format("socket") from Spark session object to read data from the socket and provide options host and port where you want to stream data from.
The output can be defined in a different mode: Complete Mode - The entire updated Result Table will be written to the external storage.
Currently, I am doing readStream once and then twice writeStream.format("").start().
You actually create two separate streaming queries. The load
-part is to describe the first (and only) streaming source. That does nothing execution-wise.
When doing so it seems that Spark read the data twice from S3 source, per each sink.
That's the most correct way to describe how Spark Structured Streaming's queries work. The number of sinks correspond to the number of queries because one streaming query can have exactly one streaming sink (see StreamExecution that sits behind any streaming query).
You can also check the number of threads (using jconsole
or similar) as Structured Streaming uses one microBatchThread
thread per streaming query (see StreamExecution).
Is there a more efficient way to write to multiple sinks in the same pipeline?
It is not possible in the current design of Spark Structured Streaming.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With