Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read streaming dataset once and output to multiple sinks?

I have Spark Structured Streaming Job that reads from S3, transforms the data and then store it to one S3 sink and one Elasticsearch sink.

Currently, I am doing readStream once and then writeStream.format("").start() twice. When doing so it seems that Spark read the data twice from S3 source, once per each sink.

Is there a more efficient way to write to multiple sinks in the same pipeline?

like image 803
s11230 Avatar asked Sep 19 '17 08:09

s11230


People also ask

Which method is used to count the streaming words and aggregate the previous data?

Streaming – Complete Output Mode This mode is used only when you have streaming aggregated data. One example would be counting the words on streaming data and aggregating with previous data and output the results to sink.

How does Pyspark read stream data?

Use readStream. format("socket") from Spark session object to read data from the socket and provide options host and port where you want to stream data from.

In which of the following modes output can be configured for structured streaming programming model?

The output can be defined in a different mode: Complete Mode - The entire updated Result Table will be written to the external storage.


1 Answers

Currently, I am doing readStream once and then twice writeStream.format("").start().

You actually create two separate streaming queries. The load-part is to describe the first (and only) streaming source. That does nothing execution-wise.

When doing so it seems that Spark read the data twice from S3 source, per each sink.

That's the most correct way to describe how Spark Structured Streaming's queries work. The number of sinks correspond to the number of queries because one streaming query can have exactly one streaming sink (see StreamExecution that sits behind any streaming query).

You can also check the number of threads (using jconsole or similar) as Structured Streaming uses one microBatchThread thread per streaming query (see StreamExecution).

Is there a more efficient way to write to multiple sinks in the same pipeline?

It is not possible in the current design of Spark Structured Streaming.

like image 193
Jacek Laskowski Avatar answered Sep 21 '22 13:09

Jacek Laskowski