PySpark: Handing NULL in Joins

Tags:

I am trying to join 2 dataframes in pyspark. My problem is I want my "Inner Join" to give it a pass, irrespective of NULLs. I can see that in scala, I have an alternate of <=>. But, <=> is not working in pyspark.

userLeft = sc.parallelize([
Row(id=u'1', 
    first_name=u'Steve', 
    last_name=u'Kent', 
    email=u'[email protected]'),
Row(id=u'2', 
    first_name=u'Margaret', 
    last_name=u'Peace', 
    email=u'[email protected]'),
Row(id=u'3', 
    first_name=None, 
    last_name=u'hh', 
    email=u'[email protected]')]).toDF()

userRight = sc.parallelize([
Row(id=u'2', 
    first_name=u'Margaret', 
    last_name=u'Peace', 
    email=u'[email protected]'),
Row(id=u'3', 
    first_name=None, 
    last_name=u'hh', 
    email=u'[email protected]')]).toDF()

Current working version:

userLeft.join(userRight, (userLeft.last_name==userRight.last_name) & (userLeft.first_name==userRight.first_name)).show()

Current Result:

    +--------------------+----------+---+---------+--------------------+----------+---+---------+
|               email|first_name| id|last_name|               email|first_name| id|last_name|
    +--------------------+----------+---+---------+--------------------+----------+---+---------+ 
    |marge.peace@email...|  Margaret|  2|    Peace|marge.peace@email...|  Margaret|  2|    Peace|
    +--------------------+----------+---+---------+--------------------+----------+---+---------+

Expected Result:

    +--------------------+----------+---+---------+--------------------+----------+---+---------+
|               email|first_name| id|last_name|               email|first_name| id|last_name|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
|  [email protected]|      null|  3|       hh|  [email protected]|      null|  3|       hh|
|marge.peace@email...|  Margaret|  2|    Peace|marge.peace@email...|  Margaret|  2|    Peace|
+--------------------+----------+---+---------+--------------------+----------+---+---------+

635

asked Sep 05 '17 19:09

orNehPraka

1 Answers

For PYSPARK < 2.3.0 you can still build the <=> operator with an expression column like this:

import pyspark.sql.functions as F
df1.alias("df1").join(df2.alias("df2"), on = F.expr('df1.column <=> df2.column'))

For PYSPARK >= 2.3.0, you can use Column.eqNullSafe or IS NOT DISTINCT FROM as answered here.

answered Sep 17 '22 20:09

Marcos Pindado

Related questions
                            
                                Not able to run oozie workflow with java action
                            
                                Hadoop Java vs C/C++ on cpu-intensive tasks
                            
                                mapreduce in java - gzip input files
                            
                                Oracle to Hadoop data ingestion in real-time
                            
                                It's possible only install Hadoop HDFS?
                            
                                hive: Using collect_set with a delimiter
                            
                                How to increase number of regions in an HBase table
                            
                                Handling Big Data in a Datawarehouse [closed]
                            
                                Hadoop Hive unable to move source to destination
                            
                                Apache Spark: Error while starting PySpark
                            
                                How to do performance profiling of Hadoop cluster
                            
                                Is HDFS necessary for Spark workloads?
                            
                                persist permissions to all files in directory HDFS
                            
                                what's the HDFS writing consistency
                            
                                Hadoop - Create external table from multiple directories in HDFS
                            
                                Do mappers store it's intermediate outputs on datanode's RAM on which it is running?
                            
                                Apache Hive: How to convert string to timestamp?
                            
                                Conversion Hive datediff() to months
                            
                                Query Parquet data through Vertica (Vertica Hadoop Integration)
                            
                                Cannot use a "." in a Hive table column name

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark: Handing NULL in Joins

Tags:

dataframe

hadoop

pyspark

orNehPraka

People also ask

1 Answers

Marcos Pindado

Recent Activity

Donate For Us