Executing separate streaming queries in spark structured streaming

Tags:

1 Answers

I've been investigating this question.

Summary: Each query in Structured Streaming consumes the source data. The socket source creates a new connection for each query defined. The behavior seen in this case is because nc is only delivering the input data to the first connection.

Henceforth, it's not possible to define multiple aggregations over the socket connection unless we can ensure that the connected socket source delivers the same data to each connection open.

I discussed this question on the Spark mailing list. Databricks developer Shixiong Zhu answered:

Spark creates one connection for each query. The behavior you observed is because how "nc -lk" works. If you use netstat to check the tcp connections, you will see there are two connections when starting two queries. However, "nc" forwards the input to only one connection.

I verified this behavior by defining a small experiment: First, I created a SimpleTCPWordServer that delivers random words to each connection open and a basic Structured Streaming job that declares two queries. The only difference between them is that the 2nd query defines an extra constant column to differentiate its output:

val lines = spark
    .readStream
    .format("socket")
    .option("host", "localhost")
    .option("port", "9999")
    .option("includeTimestamp", true)
    .load()

val q1 = lines.writeStream
  .outputMode("append")
  .format("console")
  .trigger(Trigger.ProcessingTime("5 seconds"))
  .start()

val q2 = lines.withColumn("foo", lit("foo")).writeStream
  .outputMode("append")
  .format("console")
  .trigger(Trigger.ProcessingTime("7 seconds"))
  .start()

If StructuredStreaming would consume only one stream, then we should see the same words delivered by both queries. In the case that each query consumes a separate stream, then we will have different words reported by each query.

This is the observed output:

-------------------------------------------
Batch: 0
-------------------------------------------
+--------+-------------------+
|   value|          timestamp|
+--------+-------------------+
|champion|2017-08-14 13:54:51|
+--------+-------------------+

+------+-------------------+---+
| value|          timestamp|foo|
+------+-------------------+---+
|belong|2017-08-14 13:54:51|foo|
+------+-------------------+---+

-------------------------------------------
Batch: 1
-------------------------------------------
+-------+-------------------+---+
|  value|          timestamp|foo|
+-------+-------------------+---+
| agenda|2017-08-14 13:54:52|foo|
|ceiling|2017-08-14 13:54:52|foo|
|   bear|2017-08-14 13:54:53|foo|
+-------+-------------------+---+

-------------------------------------------
Batch: 1
-------------------------------------------
+----------+-------------------+
|     value|          timestamp|
+----------+-------------------+
|    breath|2017-08-14 13:54:52|
|anticipate|2017-08-14 13:54:52|
|   amazing|2017-08-14 13:54:52|
|    bottle|2017-08-14 13:54:53|
| calculate|2017-08-14 13:54:53|
|     asset|2017-08-14 13:54:54|
|      cell|2017-08-14 13:54:54|
+----------+-------------------+

We can clearly see that the streams for each query are different. It would look like it's not possible to define multiple aggregations over the data delivered by the socket source unless we can guarantee that the TCP backend server delivers exactly the same data to each open connection.

answered Oct 30 '22 13:10

maasg

Related questions
                            
                                How to filter one spark dataframe against another dataframe
                            
                                How do I collect a single column in Spark?
                            
                                How to set the number of partitions/nodes when importing data into Spark
                            
                                Spark Error: Not enough space to cache partition rdd_8_2 in memory! Free memory is 58905314 bytes
                            
                                Spark when union a lot of RDD throws stack overflow error
                            
                                Spark SQL filter multiple fields
                            
                                Use Spark to list all files in a Hadoop HDFS directory?
                            
                                Apache Drill vs Spark [closed]
                            
                                Building a StructType from a dataframe in pyspark
                            
                                How to select last row and also how to access PySpark dataframe by index?
                            
                                How to connect to remote hive server from spark [duplicate]
                            
                                Is dataframe.show() an action in spark?
                            
                                dynamically bind variable/parameter in Spark SQL?
                            
                                Spark UI on AWS EMR
                            
                                How to fix java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List to field type scala.collection.Seq?
                            
                                Why does Scala compiler fail with "no ': _*' annotation allowed here" when Row does accept varargs?
                            
                                Scala Error: Could not find or load main class in both Scala IDE and Eclipse
                            
                                How to configure Apache Spark random worker ports for tight firewalls?
                            
                                Where is the Spark UI on Google Dataproc?
                            
                                How to convert ArrayType to DenseVector in PySpark DataFrame?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Executing separate streaming queries in spark structured streaming

Tags:

apache-spark

spark-structured-streaming

atom

People also ask

1 Answers

maasg

Recent Activity

Donate For Us