How to join Datasets on multiple columns?

Tags:

Given two Spark Datasets, A and B I can do a join on single column as follows:

a.joinWith(b, $"a.col" === $"b.col", "left")

My question is whether you can do a join using multiple columns. Essentially the equivalent of the following DataFrames api code:

a.join(b, a("col") === b("col") && a("col2") === b("col2"), "left")

479

asked Jun 16 '16 06:06

3 Answers

You can do it exactly the same way as with Dataframe:

val xs = Seq(("a", "foo", 2.0), ("x", "bar", -1.0)).toDS
val ys = Seq(("a", "foo", 2.0), ("y", "bar", 1.0)).toDS

xs.joinWith(ys, xs("_1") === ys("_1") && xs("_2") === ys("_2"), "left").show
// +------------+-----------+
// |          _1|         _2|
// +------------+-----------+
// | [a,foo,2.0]|[a,foo,2.0]|
// |[x,bar,-1.0]|       null|
// +------------+-----------+

In Spark < 2.0.0 you can use something like this:

xs.as("xs").joinWith(
  ys.as("ys"), ($"xs._1" === $"ys._1") && ($"xs._2" === $"ys._2"), "left")

answered Oct 04 '22 19:10

zero323

There's another way of joining by chaining where one after another. You first specify a join (and optionally its type) followed by where operator(s), i.e.

scala> case class A(id: Long, name: String)
defined class A

scala> case class B(id: Long, name: String)
defined class B

scala> val as = Seq(A(0, "zero"), A(1, "one")).toDS
as: org.apache.spark.sql.Dataset[A] = [id: bigint, name: string]

scala> val bs = Seq(B(0, "zero"), B(1, "jeden")).toDS
bs: org.apache.spark.sql.Dataset[B] = [id: bigint, name: string]

scala> as.join(bs).where(as("id") === bs("id")).show
+---+----+---+-----+
| id|name| id| name|
+---+----+---+-----+
|  0|zero|  0| zero|
|  1| one|  1|jeden|
+---+----+---+-----+


scala> as.join(bs).where(as("id") === bs("id")).where(as("name") === bs("name")).show
+---+----+---+----+
| id|name| id|name|
+---+----+---+----+
|  0|zero|  0|zero|
+---+----+---+----+

The reason for such a goodie is that the Spark optimizer will join (no pun intended) consecutive wheres into one with join. Use explain operator to see the underlying logical and physical plans.

scala> as.join(bs).where(as("id") === bs("id")).where(as("name") === bs("name")).explain(extended = true)
== Parsed Logical Plan ==
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
   +- Join Inner
      :- LocalRelation [id#30L, name#31]
      +- LocalRelation [id#35L, name#36]

== Analyzed Logical Plan ==
id: bigint, name: string, id: bigint, name: string
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
   +- Join Inner
      :- LocalRelation [id#30L, name#31]
      +- LocalRelation [id#35L, name#36]

== Optimized Logical Plan ==
Join Inner, ((name#31 = name#36) && (id#30L = id#35L))
:- Filter isnotnull(name#31)
:  +- LocalRelation [id#30L, name#31]
+- Filter isnotnull(name#36)
   +- LocalRelation [id#35L, name#36]

== Physical Plan ==
*BroadcastHashJoin [name#31, id#30L], [name#36, id#35L], Inner, BuildRight
:- *Filter isnotnull(name#31)
:  +- LocalTableScan [id#30L, name#31]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false], input[0, bigint, false]))
   +- *Filter isnotnull(name#36)
      +- LocalTableScan [id#35L, name#36]

answered Oct 04 '22 20:10

Jacek Laskowski

In Java, the && operator does not work. The correct way to join based on multiple columns in Spark-Java is as below:

            Dataset<Row> datasetRf1 = joinedWithDays.join(
                    datasetFreq, 
                    datasetFreq.col("userId").equalTo(joinedWithDays.col("userId"))
                    .and(datasetFreq.col("artistId").equalTo(joinedWithDays.col("artistId"))),
                            "inner"
                    );

The and function works like the && operator.

answered Oct 04 '22 19:10

ForeverLearner

Related questions
                            
                                settings.maxPrintString for Scala 2.9 REPL
                            
                                How can I convert a json string to a scala map?
                            
                                Scalaz: request for use case for Cokleisli composition
                            
                                Scala Vector fold syntax (/: and :\ and /:\)
                            
                                How to Prevent CSRF in Play [2.0] Using Scala?
                            
                                Scala: map a Map to list of tuples
                            
                                Best way to handle false unused imports in intellij
                            
                                How do I get hold of exceptions thrown in a Scala Future?
                            
                                How to suppress info and success messages in sbt?
                            
                                How can I use primitives in Scala?
                            
                                Using generic case classes in Scala
                            
                                What effect does using Action.async have, since Play uses Netty which is non-blocking
                            
                                Scala Passing Function with Argument
                            
                                How to clone an iterator?
                            
                                scala: memoize a function no matter how many arguments the function takes?
                            
                                Coding with Scala implicits in style
                            
                                Maximum Length for scala queue
                            
                                Efficient string concatenation in Scala
                            
                                How to use constant value in UDF of Spark SQL(DataFrame)
                            
                                Difference between org.apache.spark.ml.classification and org.apache.spark.mllib.classification

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to join Datasets on multiple columns?

Tags:

scala

apache-spark

apache-spark-sql

d80tb7

People also ask

3 Answers

zero323

Jacek Laskowski

ForeverLearner

Recent Activity

Donate For Us