I have two dataframes with null values that I'm trying to join using PySpark 2.3.0:
dfA:
# +----+----+
# |col1|col2|
# +----+----+
# | a|null|
# | b| 0|
# | c| 0|
# +----+----+
dfB:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null| x|
# | b| 0| x|
# +----+----+----+
The dataframes are creatable with this script:
dfA = spark.createDataFrame(
[
('a', None),
('b', '0'),
('c', '0')
],
('col1', 'col2')
)
dfB = spark.createDataFrame(
[
('a', None, 'x'),
('b', '0', 'x')
],
('col1', 'col2', 'col3')
)
Join call:
dfA.join(dfB, dfB.columns[:2], how='left').orderBy('col1').show()
Result:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null|null| <- col3 should be x
# | b| 0| x|
# | c| 0|null|
# +----+----+----+
Expected result:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null| x| <-
# | b| 0| x|
# | c| 0|null|
# +----+----+----+
It works if I set the first row, col2 to anything other than null, but I need to support null values.
I tried using a condition to compare using null-safe equals as outlined in this post like so:
cond = (dfA.col1.eqNullSafe(dfB.col1) & dfA.col2.eqNullSafe(dfB.col2))
dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()
Result of null-safe join:
# +----+----+----+----+----+
# |col1|col2|col1|col2|col3|
# +----+----+----+----+----+
# | a|null| a|null| x|
# | b| 0| b| 0| x|
# | c| 0|null|null|null|
# +----+----+----+----+----+
This retains duplicate columns though, I'm still looking for a way to achieve the expected result at the end of a join.
A simple solution would be to select
the columns that you want to keep. This will let you specify which source dataframe they should come from as well as avoid the duplicate column issue.
dfA.join(dfB, cond, how='left').select(dfA.col1, dfA.col2, dfB.col3).orderBy('col1').show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With