Including null values in an Apache Spark Join

Tags:

I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default.

Here is the default Spark behavior.

val numbersDf = Seq(
  ("123"),
  ("456"),
  (null),
  ("")
).toDF("numbers")

val lettersDf = Seq(
  ("123", "abc"),
  ("456", "def"),
  (null, "zzz"),
  ("", "hhh")
).toDF("numbers", "letters")

val joinedDf = numbersDf.join(lettersDf, Seq("numbers"))

Here is the output of joinedDf.show():

+-------+-------+
|numbers|letters|
+-------+-------+
|    123|    abc|
|    456|    def|
|       |    hhh|
+-------+-------+

This is the output I would like:

+-------+-------+
|numbers|letters|
+-------+-------+
|    123|    abc|
|    456|    def|
|       |    hhh|
|   null|    zzz|
+-------+-------+

519

asked Jan 18 '17 20:01

Powers

4 Answers

Spark provides a special NULL safe equality operator:

numbersDf
  .join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers"))
  .drop(lettersDf("numbers"))

+-------+-------+
|numbers|letters|
+-------+-------+
|    123|    abc|
|    456|    def|
|   null|    zzz|
|       |    hhh|
+-------+-------+

Be careful not to use it with Spark 1.5 or earlier. Prior to Spark 1.6 it required a Cartesian product (SPARK-11111 - Fast null-safe join).

In Spark 2.3.0 or later you can use Column.eqNullSafe in PySpark:

numbers_df = sc.parallelize([
    ("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])

letters_df = sc.parallelize([
    ("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])

numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))

+-------+-------+-------+
|numbers|numbers|letters|
+-------+-------+-------+
|    456|    456|    def|
|   null|   null|    zzz|
|       |       |    hhh|
|    123|    123|    abc|
+-------+-------+-------+

and %<=>% in SparkR:

numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, "")))
letters_df <- createDataFrame(data.frame(
  numbers = c("123", "456", NA, ""),
  letters = c("abc", "def", "zzz", "hhh")
))

head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))

  numbers numbers letters
1     456     456     def
2    <NA>    <NA>     zzz
3                     hhh
4     123     123     abc

With SQL (Spark 2.2.0+) you can use IS NOT DISTINCT FROM:

SELECT * FROM numbers JOIN letters 
ON numbers.numbers IS NOT DISTINCT FROM letters.numbers

This is can be used with DataFrame API as well:

numbersDf.alias("numbers")
  .join(lettersDf.alias("letters"))
  .where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")

answered Oct 02 '22 22:10

zero323

val numbers2 = numbersDf.withColumnRenamed("numbers","num1") //rename columns so that we can disambiguate them in the join
val letters2 = lettersDf.withColumnRenamed("numbers","num2")
val joinedDf = numbers2.join(letters2, $"num1" === $"num2" || ($"num1".isNull &&  $"num2".isNull) ,"outer")
joinedDf.select("num1","letters").withColumnRenamed("num1","numbers").show  //rename the columns back to the original names

answered Oct 02 '22 21:10

jasonS

Based on K L's idea, you could use foldLeft to generate join column expression:

def nullSafeJoin(rightDF: DataFrame, columns: Seq[String], joinType: String)(leftDF: DataFrame): DataFrame = 
{

  val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
  val fullExpr = columns.tail.foldLeft(colExpr) { 
    (colExpr, p) => colExpr && leftDF(p) <=> rightDF(p) 
  }

  leftDF.join(rightDF, fullExpr, joinType)
}

then, you could call this function just like:

aDF.transform(nullSafejoin(bDF, columns, joinType))

answered Oct 02 '22 21:10

timothyzhang

Complementing the other answers, for PYSPARK < 2.3.0 you would not have Column.eqNullSafe neither IS NOT DISTINCT FROM.

You still can build the <=> operator with an sql expression to include it in the join, as long as you define alias for the join queries:

from pyspark.sql.types import StringType
import pyspark.sql.functions as F

numbers_df = spark.createDataFrame (["123","456",None,""], StringType()).toDF("numbers")
letters_df = spark.createDataFrame ([("123", "abc"),("456", "def"),(None, "zzz"),("", "hhh") ]).\
    toDF("numbers", "letters")

joined_df = numbers_df.alias("numbers").join(letters_df.alias("letters"),
                                             F.expr('numbers.numbers <=> letters.numbers')).\
    select('letters.*')
joined_df.show()

+-------+-------+
|numbers|letters|
+-------+-------+
|    456|    def|
|   null|    zzz|
|       |    hhh|
|    123|    abc|
+-------+-------+

answered Oct 02 '22 21:10

Marcos Pindado

Related questions
                            
                                Passing List<> to SQL Stored Procedure
                            
                                How can I force a query to not use a index on a given table?
                            
                                How to return multiple values in one column (T-SQL)?
                            
                                When should I use primary key or index?
                            
                                What's the best way to store the days of the week an event takes place on in a relational database?
                            
                                Why do SQL databases use a write-ahead log over a command log?
                            
                                What are good ways to prevent SQL injection? [duplicate]
                            
                                Equivalent of SQL ISNULL in LINQ?
                            
                                How do I join the most recent row in one table to another table?
                            
                                Matching all values in IN clause
                            
                                How to group mysql rows with same column value into one row?
                            
                                Is it good database design to have admin users in the same table as front-end users?
                            
                                How to kill a running SELECT statement
                            
                                Difference between Parameters.Add(string, object) and Parameters.AddWithValue
                            
                                PostgreSQL - Iterate over results of query
                            
                                Postgres Error: More than one row returned by a subquery used as an expression
                            
                                Why can't I use an alias for an aggregate in a having clause?
                            
                                Invalid use side-effecting operator Insert within a function
                            
                                SQL Return only duplicate rows
                            
                                T-SQL split string based on delimiter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Including null values in an Apache Spark Join

Tags:

sql

join

scala

apache-spark

apache-spark-sql

Powers

People also ask

4 Answers

zero323

jasonS

timothyzhang

Marcos Pindado

Recent Activity

Donate For Us