How to read streaming dataset once and output to multiple sinks?

1 Answers

Currently, I am doing readStream once and then twice writeStream.format("").start().

You actually create two separate streaming queries. The load-part is to describe the first (and only) streaming source. That does nothing execution-wise.

When doing so it seems that Spark read the data twice from S3 source, per each sink.

That's the most correct way to describe how Spark Structured Streaming's queries work. The number of sinks correspond to the number of queries because one streaming query can have exactly one streaming sink (see StreamExecution that sits behind any streaming query).

You can also check the number of threads (using jconsole or similar) as Structured Streaming uses one microBatchThread thread per streaming query (see StreamExecution).

Is there a more efficient way to write to multiple sinks in the same pipeline?

It is not possible in the current design of Spark Structured Streaming.

193

answered Sep 21 '22 13:09

Jacek Laskowski

Related questions
                            
                                Create labeledPoints from Spark DataFrame in Python
                            
                                Convert an RDD to iterable: PySpark?
                            
                                How to fully utilize all Spark nodes in cluster?
                            
                                When to use Kryo serialization in Spark?
                            
                                Spark' Dataset unpersist behaviour
                            
                                Julia on Hadoop? [closed]
                            
                                Spark vs Flink low memory available
                            
                                Spark : multiple spark-submit in parallel
                            
                                How to add source file name to each row in Spark?
                            
                                --files option in pyspark not working
                            
                                Spark: how to use SparkContext.textFile for local file system
                            
                                Applying function to Spark Dataframe Column
                            
                                What is a glom?. How it is different from mapPartitions?
                            
                                Pyspark : forward fill with last observation for a DataFrame
                            
                                Read from a hive table and write back to it using spark sql
                            
                                pyspark parse fixed width text file
                            
                                Error while exploding a struct column in Spark
                            
                                In Spark API, What is the difference between makeRDD functions and parallelize function?
                            
                                Spark DataFrame and renaming multiple columns (Java)
                            
                                How do I order fields of my Row objects in Spark (Python)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read streaming dataset once and output to multiple sinks?

Tags:

apache-spark

spark-structured-streaming

s11230

People also ask

1 Answers

Jacek Laskowski

Recent Activity

Donate For Us