Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: Handing NULL in Joins

I am trying to join 2 dataframes in pyspark. My problem is I want my "Inner Join" to give it a pass, irrespective of NULLs. I can see that in scala, I have an alternate of <=>. But, <=> is not working in pyspark.

userLeft = sc.parallelize([
Row(id=u'1', 
    first_name=u'Steve', 
    last_name=u'Kent', 
    email=u'[email protected]'),
Row(id=u'2', 
    first_name=u'Margaret', 
    last_name=u'Peace', 
    email=u'[email protected]'),
Row(id=u'3', 
    first_name=None, 
    last_name=u'hh', 
    email=u'[email protected]')]).toDF()

userRight = sc.parallelize([
Row(id=u'2', 
    first_name=u'Margaret', 
    last_name=u'Peace', 
    email=u'[email protected]'),
Row(id=u'3', 
    first_name=None, 
    last_name=u'hh', 
    email=u'[email protected]')]).toDF()

Current working version:

userLeft.join(userRight, (userLeft.last_name==userRight.last_name) & (userLeft.first_name==userRight.first_name)).show()

Current Result:

    +--------------------+----------+---+---------+--------------------+----------+---+---------+
|               email|first_name| id|last_name|               email|first_name| id|last_name|
    +--------------------+----------+---+---------+--------------------+----------+---+---------+ 
    |marge.peace@email...|  Margaret|  2|    Peace|marge.peace@email...|  Margaret|  2|    Peace|
    +--------------------+----------+---+---------+--------------------+----------+---+---------+

Expected Result:

    +--------------------+----------+---+---------+--------------------+----------+---+---------+
|               email|first_name| id|last_name|               email|first_name| id|last_name|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
|  [email protected]|      null|  3|       hh|  [email protected]|      null|  3|       hh|
|marge.peace@email...|  Margaret|  2|    Peace|marge.peace@email...|  Margaret|  2|    Peace|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
like image 635
orNehPraka Avatar asked Sep 05 '17 19:09

orNehPraka


People also ask

Does PySpark join on null values?

Apache Spark does not consider null values when performing a join operation. If you attempt to join tables, and some of the columns contain null values, the null records will not be included in the resulting joined table.

How do you handle null in PySpark?

In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame. All the above examples return the same output.

What is null safe join?

NULL-safe equal operator. It performs an equality comparison like the = operator, but returns 1 rather than NULL if both operands are NULL, and 0 rather than NULL if one operand is NULL. a <=> b is equivalent to a = b OR (a IS NULL AND b IS NULL) .

How does Spark handle null records?

Spark Rules for Dealing with nullScala code should return None (or null) for values that are unknown, missing, or irrelevant. DataFrames should also use null for for values that are unknown, missing, or irrelevant. Use Option in Scala code and fall back on null if Option becomes a performance bottleneck.


1 Answers

For PYSPARK < 2.3.0 you can still build the <=> operator with an expression column like this:

import pyspark.sql.functions as F
df1.alias("df1").join(df2.alias("df2"), on = F.expr('df1.column <=> df2.column'))

For PYSPARK >= 2.3.0, you can use Column.eqNullSafe or IS NOT DISTINCT FROM as answered here.

like image 54
Marcos Pindado Avatar answered Sep 17 '22 20:09

Marcos Pindado