I looked at the docs and it says the following join types are supported: <blockquote> Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti. </blockquote> I looked at the StackOverflow answer on SQL joins and top couple of answers do not mention some of the joins from above e.g. <code>left_semi</code> and <code>left_anti</code>. What do they mean in Spark?

Here is a simple illustrative experiment: <pre class="prettyprint"><code>import org.apache.spark.sql._ object SparkSandbox extends App { implicit val spark = SparkSession.builder().master("local[*]").getOrCreate() import spark.implicits._ spark.sparkContext.setLogLevel("ERROR") val left = Seq((1, "A1"), (2, "A2"), (3, "A3"), (4, "A4")).toDF("id", "value") val right = Seq((3, "A3"), (4, "A4"), (4, "A4_1"), (5, "A5"), (6, "A6")).toDF("id", "value") println("LEFT") left.orderBy("id").show() println("RIGHT") right.orderBy("id").show() val joinTypes = Seq("inner", "outer", "full", "full_outer", "left", "left_outer", "right", "right_outer", "left_semi", "left_anti") joinTypes foreach { joinType => println(s"${joinType.toUpperCase()} JOIN") left.join(right = right, usingColumns = Seq("id"), joinType = joinType).orderBy("id").show() } } </code></pre> Output <pre class="prettyprint"><code>LEFT +---+-----+ | id|value| +---+-----+ | 1| A1| | 2| A2| | 3| A3| | 4| A4| +---+-----+ RIGHT +---+-----+ | id|value| +---+-----+ | 3| A3| | 4| A4| | 4| A4_1| | 5| A5| | 6| A6| +---+-----+ INNER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 3| A3| A3| | 4| A4| A4_1| | 4| A4| A4| +---+-----+-----+ OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 1| A1| null| | 2| A2| null| | 3| A3| A3| | 4| A4| A4| | 4| A4| A4_1| | 5| null| A5| | 6| null| A6| +---+-----+-----+ FULL JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 1| A1| null| | 2| A2| null| | 3| A3| A3| | 4| A4| A4| | 4| A4| A4_1| | 5| null| A5| | 6| null| A6| +---+-----+-----+ FULL_OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 1| A1| null| | 2| A2| null| | 3| A3| A3| | 4| A4| A4| | 4| A4| A4_1| | 5| null| A5| | 6| null| A6| +---+-----+-----+ LEFT JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 1| A1| null| | 2| A2| null| | 3| A3| A3| | 4| A4| A4_1| | 4| A4| A4| +---+-----+-----+ LEFT_OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 1| A1| null| | 2| A2| null| | 3| A3| A3| | 4| A4| A4_1| | 4| A4| A4| +---+-----+-----+ RIGHT JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 3| A3| A3| | 4| A4| A4_1| | 4| A4| A4| | 5| null| A5| | 6| null| A6| +---+-----+-----+ RIGHT_OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 3| A3| A3| | 4| A4| A4| | 4| A4| A4_1| | 5| null| A5| | 6| null| A6| +---+-----+-----+ LEFT_SEMI JOIN +---+-----+ | id|value| +---+-----+ | 3| A3| | 4| A4| +---+-----+ LEFT_ANTI JOIN +---+-----+ | id|value| +---+-----+ | 1| A1| | 2| A2| +---+-----+ </code></pre>

What are the various join types in Spark?

2 Answers

Here is a simple illustrative experiment:

import org.apache.spark.sql._  object SparkSandbox extends App {   implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()   import spark.implicits._   spark.sparkContext.setLogLevel("ERROR")    val left = Seq((1, "A1"), (2, "A2"), (3, "A3"), (4, "A4")).toDF("id", "value")   val right = Seq((3, "A3"), (4, "A4"), (4, "A4_1"), (5, "A5"), (6, "A6")).toDF("id", "value")    println("LEFT")   left.orderBy("id").show()    println("RIGHT")   right.orderBy("id").show()    val joinTypes = Seq("inner", "outer", "full", "full_outer", "left", "left_outer", "right", "right_outer", "left_semi", "left_anti")    joinTypes foreach { joinType =>     println(s"${joinType.toUpperCase()} JOIN")     left.join(right = right, usingColumns = Seq("id"), joinType = joinType).orderBy("id").show()   } }

Output

LEFT +---+-----+ | id|value| +---+-----+ |  1|   A1| |  2|   A2| |  3|   A3| |  4|   A4| +---+-----+  RIGHT +---+-----+ | id|value| +---+-----+ |  3|   A3| |  4|   A4| |  4| A4_1| |  5|   A5| |  6|   A6| +---+-----+  INNER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  3|   A3|   A3| |  4|   A4| A4_1| |  4|   A4|   A4| +---+-----+-----+  OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  1|   A1| null| |  2|   A2| null| |  3|   A3|   A3| |  4|   A4|   A4| |  4|   A4| A4_1| |  5| null|   A5| |  6| null|   A6| +---+-----+-----+  FULL JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  1|   A1| null| |  2|   A2| null| |  3|   A3|   A3| |  4|   A4|   A4| |  4|   A4| A4_1| |  5| null|   A5| |  6| null|   A6| +---+-----+-----+  FULL_OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  1|   A1| null| |  2|   A2| null| |  3|   A3|   A3| |  4|   A4|   A4| |  4|   A4| A4_1| |  5| null|   A5| |  6| null|   A6| +---+-----+-----+  LEFT JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  1|   A1| null| |  2|   A2| null| |  3|   A3|   A3| |  4|   A4| A4_1| |  4|   A4|   A4| +---+-----+-----+  LEFT_OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  1|   A1| null| |  2|   A2| null| |  3|   A3|   A3| |  4|   A4| A4_1| |  4|   A4|   A4| +---+-----+-----+  RIGHT JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  3|   A3|   A3| |  4|   A4| A4_1| |  4|   A4|   A4| |  5| null|   A5| |  6| null|   A6| +---+-----+-----+  RIGHT_OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  3|   A3|   A3| |  4|   A4|   A4| |  4|   A4| A4_1| |  5| null|   A5| |  6| null|   A6| +---+-----+-----+  LEFT_SEMI JOIN +---+-----+ | id|value| +---+-----+ |  3|   A3| |  4|   A4| +---+-----+  LEFT_ANTI JOIN +---+-----+ | id|value| +---+-----+ |  1|   A1| |  2|   A2| +---+-----+

198

answered Sep 22 '22 14:09

pathikrit

Loved Pathikrit's example. Here is a possible translation in Java using Spark v2 and dataframes, including cross-join.

package net.jgp.books.sparkInAction.ch12.lab940AllJoins;  import java.util.ArrayList; import java.util.List;  import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType;  /**  * All joins in a single app, inspired by  * https://stackoverflow.com/questions/45990633/what-are-the-various-join-types-in-spark.  *   * Used in Spark in Action 2e, http://jgp.net/sia  *   * @author jgp  */ public class AllJoinsApp {    /**    * main() is your entry point to the application.    *     * @param args    */   public static void main(String[] args) {     AllJoinsApp app = new AllJoinsApp();     app.start();   }    /**    * The processing code.    */   private void start() {     // Creates a session on a local master     SparkSession spark = SparkSession.builder()         .appName("Processing of invoices")         .master("local")         .getOrCreate();      StructType schema = DataTypes.createStructType(new StructField[] {         DataTypes.createStructField(             "id",             DataTypes.IntegerType,             false),         DataTypes.createStructField(             "value",             DataTypes.StringType,             false) });      List<Row> rows = new ArrayList<Row>();     rows.add(RowFactory.create(1, "A1"));     rows.add(RowFactory.create(2, "A2"));     rows.add(RowFactory.create(3, "A3"));     rows.add(RowFactory.create(4, "A4"));     Dataset<Row> dfLeft = spark.createDataFrame(rows, schema);     dfLeft.show();      rows = new ArrayList<Row>();     rows.add(RowFactory.create(3, "A3"));     rows.add(RowFactory.create(4, "A4"));     rows.add(RowFactory.create(4, "A4_1"));     rows.add(RowFactory.create(5, "A5"));     rows.add(RowFactory.create(6, "A6"));     Dataset<Row> dfRight = spark.createDataFrame(rows, schema);     dfRight.show();      String[] joinTypes = new String[] {          "inner", // v2.0.0. default         "cross", // v2.2.0         "outer", // v2.0.0         "full", // v2.1.1         "full_outer", // v2.1.1         "left", // v2.1.1         "left_outer", // v2.0.0         "right", // v2.1.1         "right_outer", // v2.0.0         "left_semi", // v2.0.0, was leftsemi before v2.1.1         "left_anti" // v2.1.1         };      for (String joinType : joinTypes) {       System.out.println(joinType.toUpperCase() + " JOIN");       Dataset<Row> df = dfLeft.join(           dfRight,            dfLeft.col("id").equalTo(dfRight.col("id")),            joinType);       df.orderBy(dfLeft.col("id")).show();     }   } }

I'll put this example in the Spark in Action, 2e's chapter 12 repository.

answered Sep 25 '22 14:09

jgp

Related questions
                            
                                The type system in Scala is Turing complete. Proof? Example? Benefits?
                            
                                What is the difference between a class and a type in Scala (and Java)?
                            
                                Advantages of Scala's type system
                            
                                Why does a small change to this Scala code make such a huge difference to performance?
                            
                                Are there any provable real-world languages? (scala?)
                            
                                scala: How to pass an expanded list as varargs into a method?
                            
                                Can we use match to check the type of a class
                            
                                How can colored terminal output be disabled for sbt/play?
                            
                                Adding an item to an immutable Seq
                            
                                Scala UTC timestamp in seconds since January 1st, 1970
                            
                                Declare variable in a Play2 scala template
                            
                                How to generate a list of random numbers?
                            
                                Automatically and Elegantly flatten DataFrame in Spark SQL
                            
                                In Scala, how to get a slice of a list from nth element to the end of the list without knowing the length?
                            
                                ScalaTest: Assert exceptions in failed futures (non-blocking)
                            
                                Why can a Scala trait extend a class?
                            
                                Scala repl throws error
                            
                                How can I convert a Java Iterable to a Scala Iterable?
                            
                                aggregate function Count usage with groupBy in Spark
                            
                                Scala's '::' operator, how does it work?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are the various join types in Spark?

Tags:

scala

apache-spark

apache-spark-sql

spark-dataframe

apache-spark-2.0

pathikrit

People also ask

2 Answers

pathikrit

jgp

Recent Activity

Donate For Us