I looked at the docs and it says the following join types are supported:
Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.
I looked at the StackOverflow answer on SQL joins and top couple of answers do not mention some of the joins from above e.g. left_semi
and left_anti
. What do they mean in Spark?
Join in Spark SQL | 7 Different Types of Joins in Spark SQL (Examples)
There are four main types of JOINs in SQL: INNER JOIN, OUTER JOIN, CROSS JOIN, and SELF JOIN.
Basically, we have only three types of joins: Inner join, Outer join, and Cross join. We use any of these three JOINS to join a table to itself.
Four types of joins: left, right, inner, and outer. In general, you'll only really need to use inner joins and left outer joins.
Here is a simple illustrative experiment:
import org.apache.spark.sql._ object SparkSandbox extends App { implicit val spark = SparkSession.builder().master("local[*]").getOrCreate() import spark.implicits._ spark.sparkContext.setLogLevel("ERROR") val left = Seq((1, "A1"), (2, "A2"), (3, "A3"), (4, "A4")).toDF("id", "value") val right = Seq((3, "A3"), (4, "A4"), (4, "A4_1"), (5, "A5"), (6, "A6")).toDF("id", "value") println("LEFT") left.orderBy("id").show() println("RIGHT") right.orderBy("id").show() val joinTypes = Seq("inner", "outer", "full", "full_outer", "left", "left_outer", "right", "right_outer", "left_semi", "left_anti") joinTypes foreach { joinType => println(s"${joinType.toUpperCase()} JOIN") left.join(right = right, usingColumns = Seq("id"), joinType = joinType).orderBy("id").show() } }
Output
LEFT +---+-----+ | id|value| +---+-----+ | 1| A1| | 2| A2| | 3| A3| | 4| A4| +---+-----+ RIGHT +---+-----+ | id|value| +---+-----+ | 3| A3| | 4| A4| | 4| A4_1| | 5| A5| | 6| A6| +---+-----+ INNER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 3| A3| A3| | 4| A4| A4_1| | 4| A4| A4| +---+-----+-----+ OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 1| A1| null| | 2| A2| null| | 3| A3| A3| | 4| A4| A4| | 4| A4| A4_1| | 5| null| A5| | 6| null| A6| +---+-----+-----+ FULL JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 1| A1| null| | 2| A2| null| | 3| A3| A3| | 4| A4| A4| | 4| A4| A4_1| | 5| null| A5| | 6| null| A6| +---+-----+-----+ FULL_OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 1| A1| null| | 2| A2| null| | 3| A3| A3| | 4| A4| A4| | 4| A4| A4_1| | 5| null| A5| | 6| null| A6| +---+-----+-----+ LEFT JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 1| A1| null| | 2| A2| null| | 3| A3| A3| | 4| A4| A4_1| | 4| A4| A4| +---+-----+-----+ LEFT_OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 1| A1| null| | 2| A2| null| | 3| A3| A3| | 4| A4| A4_1| | 4| A4| A4| +---+-----+-----+ RIGHT JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 3| A3| A3| | 4| A4| A4_1| | 4| A4| A4| | 5| null| A5| | 6| null| A6| +---+-----+-----+ RIGHT_OUTER JOIN +---+-----+-----+ | id|value|value| +---+-----+-----+ | 3| A3| A3| | 4| A4| A4| | 4| A4| A4_1| | 5| null| A5| | 6| null| A6| +---+-----+-----+ LEFT_SEMI JOIN +---+-----+ | id|value| +---+-----+ | 3| A3| | 4| A4| +---+-----+ LEFT_ANTI JOIN +---+-----+ | id|value| +---+-----+ | 1| A1| | 2| A2| +---+-----+
Loved Pathikrit's example. Here is a possible translation in Java using Spark v2 and dataframes, including cross-join.
package net.jgp.books.sparkInAction.ch12.lab940AllJoins; import java.util.ArrayList; import java.util.List; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; /** * All joins in a single app, inspired by * https://stackoverflow.com/questions/45990633/what-are-the-various-join-types-in-spark. * * Used in Spark in Action 2e, http://jgp.net/sia * * @author jgp */ public class AllJoinsApp { /** * main() is your entry point to the application. * * @param args */ public static void main(String[] args) { AllJoinsApp app = new AllJoinsApp(); app.start(); } /** * The processing code. */ private void start() { // Creates a session on a local master SparkSession spark = SparkSession.builder() .appName("Processing of invoices") .master("local") .getOrCreate(); StructType schema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField( "id", DataTypes.IntegerType, false), DataTypes.createStructField( "value", DataTypes.StringType, false) }); List<Row> rows = new ArrayList<Row>(); rows.add(RowFactory.create(1, "A1")); rows.add(RowFactory.create(2, "A2")); rows.add(RowFactory.create(3, "A3")); rows.add(RowFactory.create(4, "A4")); Dataset<Row> dfLeft = spark.createDataFrame(rows, schema); dfLeft.show(); rows = new ArrayList<Row>(); rows.add(RowFactory.create(3, "A3")); rows.add(RowFactory.create(4, "A4")); rows.add(RowFactory.create(4, "A4_1")); rows.add(RowFactory.create(5, "A5")); rows.add(RowFactory.create(6, "A6")); Dataset<Row> dfRight = spark.createDataFrame(rows, schema); dfRight.show(); String[] joinTypes = new String[] { "inner", // v2.0.0. default "cross", // v2.2.0 "outer", // v2.0.0 "full", // v2.1.1 "full_outer", // v2.1.1 "left", // v2.1.1 "left_outer", // v2.0.0 "right", // v2.1.1 "right_outer", // v2.0.0 "left_semi", // v2.0.0, was leftsemi before v2.1.1 "left_anti" // v2.1.1 }; for (String joinType : joinTypes) { System.out.println(joinType.toUpperCase() + " JOIN"); Dataset<Row> df = dfLeft.join( dfRight, dfLeft.col("id").equalTo(dfRight.col("id")), joinType); df.orderBy(dfLeft.col("id")).show(); } } }
I'll put this example in the Spark in Action, 2e's chapter 12 repository.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With