Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dataframe Join Null-Safe Condition Use

I have two dataframes with null values that I'm trying to join using PySpark 2.3.0:

dfA:

# +----+----+
# |col1|col2|
# +----+----+
# |   a|null|
# |   b|   0|
# |   c|   0|
# +----+----+

dfB:

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|   x|
# |   b|   0|   x|
# +----+----+----+

The dataframes are creatable with this script:

dfA = spark.createDataFrame(
    [
        ('a', None),
        ('b', '0'),
        ('c', '0')
    ],
    ('col1', 'col2')
)

dfB = spark.createDataFrame(
    [
        ('a', None, 'x'),
        ('b', '0', 'x')
    ],
    ('col1', 'col2', 'col3')
)

Join call:

dfA.join(dfB, dfB.columns[:2], how='left').orderBy('col1').show()

Result:

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|null|  <- col3 should be x
# |   b|   0|   x|
# |   c|   0|null|
# +----+----+----+

Expected result:

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|   x|  <-
# |   b|   0|   x|
# |   c|   0|null|
# +----+----+----+

It works if I set the first row, col2 to anything other than null, but I need to support null values.

I tried using a condition to compare using null-safe equals as outlined in this post like so:

cond = (dfA.col1.eqNullSafe(dfB.col1) & dfA.col2.eqNullSafe(dfB.col2))
dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()

Result of null-safe join:

# +----+----+----+----+----+
# |col1|col2|col1|col2|col3|
# +----+----+----+----+----+
# |   a|null|   a|null|   x|
# |   b|   0|   b|   0|   x|
# |   c|   0|null|null|null|
# +----+----+----+----+----+

This retains duplicate columns though, I'm still looking for a way to achieve the expected result at the end of a join.

like image 344
Tibberzz Avatar asked Jun 13 '18 22:06

Tibberzz


1 Answers

A simple solution would be to select the columns that you want to keep. This will let you specify which source dataframe they should come from as well as avoid the duplicate column issue.

dfA.join(dfB, cond, how='left').select(dfA.col1, dfA.col2, dfB.col3).orderBy('col1').show()
like image 159
Shaido Avatar answered Sep 29 '22 13:09

Shaido