Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the various join types in Spark?

I looked at the docs and it says the following join types are supported:

Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.

I looked at the StackOverflow answer on SQL joins and top couple of answers do not mention some of the joins from above e.g. left_semi and left_anti. What do they mean in Spark?

like image 499
pathikrit Avatar asked Aug 31 '17 21:08

pathikrit


People also ask

How many joins are there in Spark?

Join in Spark SQL | 7 Different Types of Joins in Spark SQL (Examples)

What are the 4 join types?

There are four main types of JOINs in SQL: INNER JOIN, OUTER JOIN, CROSS JOIN, and SELF JOIN.

What are the 3 types of joins?

Basically, we have only three types of joins: Inner join, Outer join, and Cross join. We use any of these three JOINS to join a table to itself.

How many types of joins are there?

Four types of joins: left, right, inner, and outer. In general, you'll only really need to use inner joins and left outer joins.


2 Answers

Here is a simple illustrative experiment:

import org.apache.spark.sql._  object SparkSandbox extends App {   implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()   import spark.implicits._   spark.sparkContext.setLogLevel("ERROR")    val left = Seq((1, "A1"), (2, "A2"), (3, "A3"), (4, "A4")).toDF("id", "value")   val right = Seq((3, "A3"), (4, "A4"), (4, "A4_1"), (5, "A5"), (6, "A6")).toDF("id", "value")    println("LEFT")   left.orderBy("id").show()    println("RIGHT")   right.orderBy("id").show()    val joinTypes = Seq("inner", "outer", "full", "full_outer", "left", "left_outer", "right", "right_outer", "left_semi", "left_anti")    joinTypes foreach { joinType =>     println(s"${joinType.toUpperCase()} JOIN")     left.join(right = right, usingColumns = Seq("id"), joinType = joinType).orderBy("id").show()   } } 

Output

LEFT +---+-----+ | id|value| +---+-----+ |  1|   A1| |  2|   A2| |  3|   A3| |  4|   A4| +---+-----+  RIGHT +---+-----+ | id|value| +---+-----+ |  3|   A3| |  4|   A4| |  4| A4_1| |  5|   A5| |  6|   A6| +---+-----+  INNER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  3|   A3|   A3| |  4|   A4| A4_1| |  4|   A4|   A4| +---+-----+-----+  OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  1|   A1| null| |  2|   A2| null| |  3|   A3|   A3| |  4|   A4|   A4| |  4|   A4| A4_1| |  5| null|   A5| |  6| null|   A6| +---+-----+-----+  FULL JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  1|   A1| null| |  2|   A2| null| |  3|   A3|   A3| |  4|   A4|   A4| |  4|   A4| A4_1| |  5| null|   A5| |  6| null|   A6| +---+-----+-----+  FULL_OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  1|   A1| null| |  2|   A2| null| |  3|   A3|   A3| |  4|   A4|   A4| |  4|   A4| A4_1| |  5| null|   A5| |  6| null|   A6| +---+-----+-----+  LEFT JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  1|   A1| null| |  2|   A2| null| |  3|   A3|   A3| |  4|   A4| A4_1| |  4|   A4|   A4| +---+-----+-----+  LEFT_OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  1|   A1| null| |  2|   A2| null| |  3|   A3|   A3| |  4|   A4| A4_1| |  4|   A4|   A4| +---+-----+-----+  RIGHT JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  3|   A3|   A3| |  4|   A4| A4_1| |  4|   A4|   A4| |  5| null|   A5| |  6| null|   A6| +---+-----+-----+  RIGHT_OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ |  3|   A3|   A3| |  4|   A4|   A4| |  4|   A4| A4_1| |  5| null|   A5| |  6| null|   A6| +---+-----+-----+  LEFT_SEMI JOIN +---+-----+ | id|value| +---+-----+ |  3|   A3| |  4|   A4| +---+-----+  LEFT_ANTI JOIN +---+-----+ | id|value| +---+-----+ |  1|   A1| |  2|   A2| +---+-----+ 
like image 198
pathikrit Avatar answered Sep 22 '22 14:09

pathikrit


Loved Pathikrit's example. Here is a possible translation in Java using Spark v2 and dataframes, including cross-join.

package net.jgp.books.sparkInAction.ch12.lab940AllJoins;  import java.util.ArrayList; import java.util.List;  import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType;  /**  * All joins in a single app, inspired by  * https://stackoverflow.com/questions/45990633/what-are-the-various-join-types-in-spark.  *   * Used in Spark in Action 2e, http://jgp.net/sia  *   * @author jgp  */ public class AllJoinsApp {    /**    * main() is your entry point to the application.    *     * @param args    */   public static void main(String[] args) {     AllJoinsApp app = new AllJoinsApp();     app.start();   }    /**    * The processing code.    */   private void start() {     // Creates a session on a local master     SparkSession spark = SparkSession.builder()         .appName("Processing of invoices")         .master("local")         .getOrCreate();      StructType schema = DataTypes.createStructType(new StructField[] {         DataTypes.createStructField(             "id",             DataTypes.IntegerType,             false),         DataTypes.createStructField(             "value",             DataTypes.StringType,             false) });      List<Row> rows = new ArrayList<Row>();     rows.add(RowFactory.create(1, "A1"));     rows.add(RowFactory.create(2, "A2"));     rows.add(RowFactory.create(3, "A3"));     rows.add(RowFactory.create(4, "A4"));     Dataset<Row> dfLeft = spark.createDataFrame(rows, schema);     dfLeft.show();      rows = new ArrayList<Row>();     rows.add(RowFactory.create(3, "A3"));     rows.add(RowFactory.create(4, "A4"));     rows.add(RowFactory.create(4, "A4_1"));     rows.add(RowFactory.create(5, "A5"));     rows.add(RowFactory.create(6, "A6"));     Dataset<Row> dfRight = spark.createDataFrame(rows, schema);     dfRight.show();      String[] joinTypes = new String[] {          "inner", // v2.0.0. default         "cross", // v2.2.0         "outer", // v2.0.0         "full", // v2.1.1         "full_outer", // v2.1.1         "left", // v2.1.1         "left_outer", // v2.0.0         "right", // v2.1.1         "right_outer", // v2.0.0         "left_semi", // v2.0.0, was leftsemi before v2.1.1         "left_anti" // v2.1.1         };      for (String joinType : joinTypes) {       System.out.println(joinType.toUpperCase() + " JOIN");       Dataset<Row> df = dfLeft.join(           dfRight,            dfLeft.col("id").equalTo(dfRight.col("id")),            joinType);       df.orderBy(dfLeft.col("id")).show();     }   } } 

I'll put this example in the Spark in Action, 2e's chapter 12 repository.

like image 42
jgp Avatar answered Sep 25 '22 14:09

jgp