Outer join two Datasets (not DataFrames) in Spark Structured Streaming

Tags:

I have some code that joins two streaming DataFrames and outputs to console.

val dataFrame1 =
  df1Input.withWatermark("timestamp", "40 seconds").as("A")

val dataFrame2 =
  df2Input.withWatermark("timestamp", "40 seconds").as("B")

val finalDF: DataFrame = dataFrame1.join(dataFrame2,
      expr(
        "A.id = B.id" +
          " AND " +
          "B.timestamp >= A.timestamp " +
          " AND " +
          "B.timestamp <= A.timestamp + interval 1 hour")
      , joinType = "leftOuter")
finalDF.writeStream.format("console").start().awaitTermination()

What I now want is to refactor this part to use Datasets, so I can have some compile-time checking.

So what I tried was pretty straightforward:

val finalDS: Dataset[(A,B)] = dataFrame1.as[A].joinWith(dataFrame2.as[B],
      expr(
        "A.id = B.id" +
          " AND " +
          "B.timestamp >= A.timestamp " +
          " AND " +
          "B.timestamp <= A.timestamp + interval 1 hour")
      , joinType = "leftOuter")
finalDS.writeStream.format("console").start().awaitTermination()

However, this gives the following error:

org.apache.spark.sql.AnalysisException: Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition;;

As you can see, the join code hasn't changed, so there is a watermark on both sides and a range condition. The only change was to use the Dataset API instead of DataFrame.

Also, it is fine when I use inner join:

val finalDS: Dataset[(A,B)] = dataFrame1.as[A].joinWith(dataFrame2.as[B],
          expr(
            "A.id = B.id" +
              " AND " +
              "B.timestamp >= A.timestamp " +
              " AND " +
              "B.timestamp <= A.timestamp + interval 1 hour")
          )
    finalDS.writeStream.format("console").start().awaitTermination()

Does anyone know how can this happen?

585

asked Jul 09 '18 07:07

Shikkou

1 Answers

Well, when you using joinWith method instead of join you rely on different implementation and it seems like this implementation not support leftOuter join for streaming Datasets.

You can check outer joins with watermarking section of the official documentation. Method join not joinWith used. Note that result type will be DataFrame. That means that you most likely will have to map field manually

val finalDS = dataFrame1.as[A].join(dataFrame2.as[B],
    expr(
      "A.key = B.key" +
        " AND " +
        "B.timestamp >= A.timestamp " +
        " AND " +
        "B.timestamp <= A.timestamp + interval 1 hour"),
    joinType = "leftOuter").select(/* useful fields */).as[C]

143

answered Sep 19 '22 19:09

addmeaning

Related questions
                            
                                Create Custom Cross Validation in Spark ML
                            
                                Self referencing a val during definition in scala
                            
                                Get the size of a resource
                            
                                How do you debug typelevel code?
                            
                                Why won't this Spark sample code load in spark-shell?
                            
                                How do I make a block aware execution context?
                            
                                too many map keys causing out of memory exception in spark
                            
                                How do you use play framework as a library, in a scala project
                            
                                Play Framework PathBindable with Dependency Injection
                            
                                How to early return in Scala [duplicate]
                            
                                Performance of loading parquet files into case classes in Spark
                            
                                Why is there a difference between Java8 and Scala2.12 lambda cache?
                            
                                Load a file from SFTP server into spark RDD
                            
                                Change a variable in the current sbt task scope
                            
                                Structured Streaming - Foreach Sink
                            
                                Read data from remote hive on spark over JDBC returns empty result
                            
                                Why can't I display prediction column of Spark MultilayerPerceptronClassifier?
                            
                                Get a registered Spark Accumulator by name
                            
                                Difference Between Scala Expressions
                            
                                Why are Futures within Futures running sequentially when started on Akka Dispatcher

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Outer join two Datasets (not DataFrames) in Spark Structured Streaming

Tags:

scala

apache-spark

apache-spark-sql

spark-structured-streaming

Shikkou

People also ask

1 Answers

addmeaning

Recent Activity

Donate For Us