Why does Complete output mode require aggregation?

Tags:

apache-spark

spark-structured-streaming

I work with the latest Structured Streaming in Apache Spark 2.2 and got the following exception:

org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;;

Why does Complete output mode require a streaming aggregation? What would happen if Spark allowed Complete output mode with no aggregations in a streaming query?

scala> spark.version
res0: String = 2.2.0

import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.SQLContext
implicit val sqlContext: SQLContext = spark.sqlContext
val source = MemoryStream[(Int, Int)]
val ids = source.toDS.toDF("time", "id").
  withColumn("time", $"time" cast "timestamp"). // <-- convert time column from Int to Timestamp
  dropDuplicates("id").
  withColumn("time", $"time" cast "long")  // <-- convert time column back from Timestamp to Int

import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
scala> val q = ids.
     |   writeStream.
     |   format("memory").
     |   queryName("dups").
     |   outputMode(OutputMode.Complete).  // <-- memory sink supports checkpointing for Complete output mode only
     |   trigger(Trigger.ProcessingTime(30.seconds)).
     |   option("checkpointLocation", "checkpoint-dir"). // <-- use checkpointing to save state between restarts
     |   start
org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;;
Project [cast(time#10 as bigint) AS time#15L, id#6]
+- Deduplicate [id#6], true
   +- Project [cast(time#5 as timestamp) AS time#10, id#6]
      +- Project [_1#2 AS time#5, _2#3 AS id#6]
         +- StreamingExecutionRelation MemoryStream[_1#2,_2#3], [_1#2, _2#3]

  at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
  at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:115)
  at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
  at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
  at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:247)
  ... 57 elided

919

asked Aug 18 '17 12:08

Jacek Laskowski

2 Answers

From the Structured Streaming Programming Guide - other queries (excluding aggregations, mapGroupsWithState and flatMapGroupsWithState):

Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table.

To answer the question:

What would happen if Spark allowed Complete output mode with no aggregations in a streaming query?

Probably OOM.

The puzzling part is why dropDuplicates("id") is not marked as aggregation.

107

answered Nov 13 '22 08:11

Alper t. Turker

I think the problem is the output mode. instead of using OutputMode.Complete, use OutputMode.Append as shown below.

scala> val q = ids
    .writeStream
    .format("memory")
    .queryName("dups")
    .outputMode(OutputMode.Append)
    .trigger(Trigger.ProcessingTime(30.seconds))
    .option("checkpointLocation", "checkpoint-dir")
    .start

answered Nov 13 '22 08:11

Thomas Okonkwo

Related questions
                            
                                Writing more than 50 millions from Pyspark df to PostgresSQL, best efficient approach
                            
                                Spark: Writing to Avro file
                            
                                Apache Spark: pyspark crash for large dataset
                            
                                Understanding Spark's closures and their serialization
                            
                                apache spark MLLib: how to build labeled points for string features?
                            
                                How to suppress parquet log messages in Spark?
                            
                                Apache spark: setting spark.eventLog.enabled and spark.eventLog.dir at submit or Spark start
                            
                                How to create Spark RDD from an iterator?
                            
                                How does Apache Spark know about HDFS data nodes?
                            
                                Apache Spark throws NullPointerException when encountering missing feature
                            
                                In Spark, what is the right way to have a static object on all workers?
                            
                                Spark DataFrame Schema Nullable Fields
                            
                                Coalesce reduces parallelism of entire stage (spark)
                            
                                How to use java.time.LocalDate in Datasets (fails with java.lang.UnsupportedOperationException: No Encoder found)? [duplicate]
                            
                                Saving dataframe to local file system results in empty results
                            
                                Does groupByKey in Spark preserve the original order?
                            
                                Spark on Amazon EMR: "Timeout waiting for connection from pool"
                            
                                How to execute Spark programs with Dynamic Resource Allocation?
                            
                                Difference between reduce and reduceByKey in Apache Spark
                            
                                What is scheduler delay in spark UI's event timeline

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With