I am trying to join 2 dataframes in pyspark. My problem is I want my "Inner Join" to give it a pass, irrespective of NULLs. I can see that in scala, I have an alternate of <=>. But, <=> is not working in pyspark.
userLeft = sc.parallelize([
Row(id=u'1',
first_name=u'Steve',
last_name=u'Kent',
email=u'[email protected]'),
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'[email protected]'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'[email protected]')]).toDF()
userRight = sc.parallelize([
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'[email protected]'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'[email protected]')]).toDF()
Current working version:
userLeft.join(userRight, (userLeft.last_name==userRight.last_name) & (userLeft.first_name==userRight.first_name)).show()
Current Result:
+--------------------+----------+---+---------+--------------------+----------+---+---------+
| email|first_name| id|last_name| email|first_name| id|last_name|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
|marge.peace@email...| Margaret| 2| Peace|marge.peace@email...| Margaret| 2| Peace|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
Expected Result:
+--------------------+----------+---+---------+--------------------+----------+---+---------+
| email|first_name| id|last_name| email|first_name| id|last_name|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
| [email protected]| null| 3| hh| [email protected]| null| 3| hh|
|marge.peace@email...| Margaret| 2| Peace|marge.peace@email...| Margaret| 2| Peace|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
Apache Spark does not consider null values when performing a join operation. If you attempt to join tables, and some of the columns contain null values, the null records will not be included in the resulting joined table.
In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame. All the above examples return the same output.
NULL-safe equal operator. It performs an equality comparison like the = operator, but returns 1 rather than NULL if both operands are NULL, and 0 rather than NULL if one operand is NULL. a <=> b is equivalent to a = b OR (a IS NULL AND b IS NULL) .
Spark Rules for Dealing with nullScala code should return None (or null) for values that are unknown, missing, or irrelevant. DataFrames should also use null for for values that are unknown, missing, or irrelevant. Use Option in Scala code and fall back on null if Option becomes a performance bottleneck.
For PYSPARK < 2.3.0 you can still build the <=> operator with an expression column like this:
import pyspark.sql.functions as F
df1.alias("df1").join(df2.alias("df2"), on = F.expr('df1.column <=> df2.column'))
For PYSPARK >= 2.3.0, you can use Column.eqNullSafe or IS NOT DISTINCT FROM as answered here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With