I am trying to get the latest records from a table using self join. It works using <code>spark-sql</code> but not working using spark <code>DataFrame</code> API. Can anyone help? Is it a bug? I am using Spark 2.2.0 in local mode Creating input <code>DataFrame</code>: <pre class="prettyprint"><code>scala> val df3 = spark.sparkContext.parallelize(Array((1,"a",1),(1,"aa",2),(2,"b",2),(2,"bb",5))).toDF("id","value","time") df3: org.apache.spark.sql.DataFrame = [id: int, value: string ... 1 more field] scala> val df33 = df3 df33: org.apache.spark.sql.DataFrame = [id: int, value: string ... 1 more field] scala> df3.show +---+-----+----+ | id|value|time| +---+-----+----+ | 1| a| 1| | 1| aa| 2| | 2| b| 2| | 2| bb| 5| +---+-----+----+ scala> df33.show +---+-----+----+ | id|value|time| +---+-----+----+ | 1| a| 1| | 1| aa| 2| | 2| b| 2| | 2| bb| 5| +---+-----+----+ </code></pre> Now performing the join using SQL: works <pre class="prettyprint"><code>scala> spark.sql("select df33.* from df3 join df33 on df3.id = df33.id and df3.time < df33.time").show +---+-----+----+ | id|value|time| +---+-----+----+ | 1| aa| 2| | 2| bb| 5| +---+-----+----+ </code></pre> Now performing the join using dataframe API: doesn't work <pre class="prettyprint"><code>scala> df3.join(df33, (df3.col("id") === df33.col("id")) && (df3.col("time") < df33.col("time")) ).select(df33.col("id"),df33.col("value"),df33.col("time")).show +---+-----+----+ | id|value|time| +---+-----+----+ +---+-----+----+ </code></pre> The thing to notice is the explain plans: blank for the <code>DataFrame</code> API!! <pre class="prettyprint"><code>scala> df3.join(df33, (df3.col("id") === df33.col("id")) && (df3.col("time") < df33.col("time")) ).select(df33.col("id"),df33.col("value"),df33.col("time")).explain == Physical Plan == LocalTableScan <empty>, [id#150, value#151, time#152] scala> spark.sql("select df33.* from df3 join df33 on df3.id = df33.id and df3.time < df33.time").explain == Physical Plan == *Project [id#1241, value#1242, time#1243] +- *SortMergeJoin [id#150], [id#1241], Inner, (time#152 < time#1243) :- *Sort [id#150 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#150, 200) : +- *Project [_1#146 AS id#150, _3#148 AS time#152] : +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple3, true])._1 AS _1#146, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._2, true) AS _2#147, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#148] : +- Scan ExternalRDDScan[obj#145] +- *Sort [id#1241 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#1241, 200) +- *Project [_1#146 AS id#1241, _2#147 AS value#1242, _3#148 AS time#1243] +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple3, true])._1 AS _1#146, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._2, true) AS _2#147, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#148] +- Scan ExternalRDDScan[obj#145] </code></pre>

No that's not a bug, but when you reassign the DataFrame to a new one like what you have done, it actually copies the lineage but it doesn't duplicate the data. Thus you'll be comparing on the same column. Use <code>spark.sql</code> is slightly different because it's actually working on aliases of your <code>DataFrame</code>s So the correct way to perform a self-join using the API is actually aliasing your <code>DataFrame</code> as followed : <pre class="prettyprint"><code>val df1 = Seq((1,"a",1),(1,"aa",2),(2,"b",2),(2,"bb",5)).toDF("id","value","time") df1.as("df1").join(df1.as("df2"), $"df1.id" === $"df2.id" && $"df1.time" < $"df2.time").select($"df2.*").show // +---+-----+----+ // | id|value|time| // +---+-----+----+ // | 1| aa| 2| // | 2| bb| 5| // +---+-----+----+ </code></pre> For more information about self-joins, I recommend reading High Performance Spark by Rachel Warren, Holden Karau - Chapter 4.

Self-join not working as expected with the DataFrame API

Tags:

dataframe

scala

apache-spark

apache-spark-sql

I am trying to get the latest records from a table using self join. It works using spark-sql but not working using spark DataFrame API.

Can anyone help? Is it a bug?

I am using Spark 2.2.0 in local mode

Creating input DataFrame:

scala> val df3 = spark.sparkContext.parallelize(Array((1,"a",1),(1,"aa",2),(2,"b",2),(2,"bb",5))).toDF("id","value","time")
df3: org.apache.spark.sql.DataFrame = [id: int, value: string ... 1 more field]    

scala> val df33 = df3
df33: org.apache.spark.sql.DataFrame = [id: int, value: string ... 1 more field]

scala> df3.show
+---+-----+----+
| id|value|time|
+---+-----+----+
|  1|    a|   1|
|  1|   aa|   2|
|  2|    b|   2|
|  2|   bb|   5|
+---+-----+----+

scala> df33.show
+---+-----+----+
| id|value|time|
+---+-----+----+
|  1|    a|   1|
|  1|   aa|   2|
|  2|    b|   2|
|  2|   bb|   5|
+---+-----+----+

Now performing the join using SQL: works

scala> spark.sql("select df33.* from df3 join df33 on df3.id = df33.id and df3.time < df33.time").show
+---+-----+----+
| id|value|time|
+---+-----+----+
|  1|   aa|   2|
|  2|   bb|   5|
+---+-----+----+

Now performing the join using dataframe API: doesn't work

scala> df3.join(df33, (df3.col("id") === df33.col("id")) && (df3.col("time") < df33.col("time")) ).select(df33.col("id"),df33.col("value"),df33.col("time")).show
+---+-----+----+
| id|value|time|
+---+-----+----+
+---+-----+----+

The thing to notice is the explain plans: blank for the DataFrame API!!

scala> df3.join(df33, (df3.col("id") === df33.col("id")) && (df3.col("time") < df33.col("time")) ).select(df33.col("id"),df33.col("value"),df33.col("time")).explain
== Physical Plan ==
LocalTableScan <empty>, [id#150, value#151, time#152]

scala> spark.sql("select df33.* from df3 join df33 on df3.id = df33.id and df3.time < df33.time").explain
== Physical Plan ==
*Project [id#1241, value#1242, time#1243]
+- *SortMergeJoin [id#150], [id#1241], Inner, (time#152 < time#1243)
   :- *Sort [id#150 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#150, 200)
   :     +- *Project [_1#146 AS id#150, _3#148 AS time#152]
   :        +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple3, true])._1 AS _1#146, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
assertnotnull(input[0, scala.Tuple3, true])._2, true) AS _2#147, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#148]
   :           +- Scan ExternalRDDScan[obj#145]
   +- *Sort [id#1241 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(id#1241, 200)
         +- *Project [_1#146 AS id#1241, _2#147 AS value#1242, _3#148 AS time#1243]
            +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple3, true])._1 AS _1#146, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
assertnotnull(input[0, scala.Tuple3, true])._2, true) AS _2#147, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#148]
               +- Scan ExternalRDDScan[obj#145]

297

asked Oct 31 '17 07:10

human

1 Answers

No that's not a bug, but when you reassign the DataFrame to a new one like what you have done, it actually copies the lineage but it doesn't duplicate the data. Thus you'll be comparing on the same column.

Use spark.sql is slightly different because it's actually working on aliases of your DataFrames

So the correct way to perform a self-join using the API is actually aliasing your DataFrame as followed :

val df1 = Seq((1,"a",1),(1,"aa",2),(2,"b",2),(2,"bb",5)).toDF("id","value","time")

df1.as("df1").join(df1.as("df2"), $"df1.id" === $"df2.id" && $"df1.time" < $"df2.time").select($"df2.*").show
// +---+-----+----+
// | id|value|time|
// +---+-----+----+
// |  1|   aa|   2|
// |  2|   bb|   5|
// +---+-----+----+

For more information about self-joins, I recommend reading High Performance Spark by Rachel Warren, Holden Karau - Chapter 4.

139

answered Sep 30 '22 08:09

eliasah

Related questions
                            
                                Akka Streams running on cluster nodes
                            
                                lambda calculus in scala
                            
                                Implicit ClassTag in pattern matching
                            
                                How to set content type?
                            
                                Mocking generic scala method in mockito
                            
                                Add Tuple to Map?
                            
                                Multiple actor systems for an application
                            
                                Akka websocket - how to close connection by server?
                            
                                Is it possible to force named parameters in scala?
                            
                                cake pattern - why is it so complicated
                            
                                Circe Encoders and Decoders with Http4s
                            
                                Spark union fails with nested JSON dataframe
                            
                                How to load a csv directly into a Spark Dataset?
                            
                                Why can't a Promise be covariant
                            
                                Websocket Proxy using Play 2.6 and akka streams
                            
                                Mocking Scala Objects and Functions
                            
                                String Parameters using Akka HTTP Directives during GET requests
                            
                                Remove one element from Scala List
                            
                                How to get list of all Route URL strings in play framework?
                            
                                What is the IO Haskell Monad equivalent in Scala standard API?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With