how to join two DataFrame and replace one column conditionally in spark

Tags:

There are two dataframes. For simplicity, I put them as follow:

DataFrame1

id | name
-----------
0  | Mike
1  | James

DataFrame2

id | name | salary
-------------------
0  | M    | 10
1  | J    | 20
2  | K    | 30

I want to join the two DataFrame on id and only keep the column name in DataFrame1 while keeping the original one if there is no corresponding id in DataFrame2.

It should be:

id | name  | salary
--------------------
0  | Mike  |  10
1  | James |  20
2  | K     |  30

Till now, I only know how to join two dataframes by:

df1.join(df2, df1("id")===df2("id"), "left").select(df2("id"), df1("name"), df2("salary"))

But it will use null to ignore the name value "K".

Thanks!

202

asked Dec 04 '16 15:12

Siyu Leng

2 Answers

You can use coalesce, which returns the first column that isn't null from the given columns. Plus - using left join you should join df1 to df2 and not the other way around:

import org.apache.spark.sql.functions._

df2.join(df1, df1("id")===df2("id"), "left")
  .select(df2("id"), coalesce(df1("name"), df2("name")), df2("salary"))

133

answered Nov 14 '22 22:11

Tzach Zohar

to replace null values you can use DataFrameNaFunctions like below...

    df1.join(df2, df1("id")===df2("id"), "left_outer")
.select(df2("id"), df1("name"), df2("salary"))
.na.fill(ImmutableMap.of("name", "unknown")).show()

where 'unknown' is sample value. you can replace with value you wanted...

if you don't want rows with null value columns

val joined = df1.join(df2, df1("id")===df2("id"), "left_outer")
    .select(df2("id"), df1("name"), df2("salary"))

val final = joined.where(joined.col("name").isNotNull)
final.show()

Also Note that as mentioned in @Tzach Zohar answer def coalesce(e: Column*) function

Returns the first column that is not null, or null if all inputs are null.

If you are looking for that kind of ... then you can go ahead.

answered Nov 14 '22 21:11

Ram Ghadiyaram

Related questions
                            
                                Creating a dynamic search query with PHP and MySQL
                            
                                SQL unify COUNT and SUM in one query
                            
                                CEILING returns FLOOR result - SQL SERVER 2008 R2
                            
                                Error when updating DB: SQL Network Interfaces, error: 26 - Error Locating Server/Instance Specified
                            
                                Split String into rows Oracle SQL
                            
                                Join two select queries horizontally in Postgresql
                            
                                How to create a temporary / dynamic / virtual table when a SQL runs in Oracle?
                            
                                How do I generate bacpac file from local machine and upload it on Azure blob?
                            
                                Why SELECT ... WHERE last_name = 0 returns all rows?
                            
                                How to INSERT into a table that uses sequential GUIDs as a primary key?
                            
                                Go with SQL Server driver is unable to connect successfully, login fail
                            
                                SQL-Server CONCAT case
                            
                                LINQ with Lambda expression - Join, Group By, Sum and Count
                            
                                Delete orphaned records in postgres. Delete using join. Performance
                            
                                SQL Server: how to return 1 row as default if condition not met
                            
                                How to implement a more efficient search feature?
                            
                                Aliasing field names in SQLAlchemy model or underlying SQL table
                            
                                SELECT *, ROW_NUMBER() OVER in Oracle
                            
                                DB2, MERGE INTO update if value different
                            
                                Conditionally delete item inside an Array Field PostgreSQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to join two DataFrame and replace one column conditionally in spark

Tags:

sql

join

scala

apache-spark

Siyu Leng

People also ask

2 Answers

Tzach Zohar

Ram Ghadiyaram

Recent Activity

Donate For Us