Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to merge two rows in Spark SQL?

I need to merge rows in the same dataframe based on a key column "id". In the sample data frame, 1 row has data for id,name and age. The other row has id,name, and salary. Rows with same key 'id' have to be merged a single record in the final data frame. If there is just one record, should show them as well with null values [Smith, and Jake] as in example below.

The computation needs to happen on real time data, spark native function based solution would be ideal. I have tried filtering the records based on age and city columns to separate data frames and them perform a left join on ID. But its not very efficient. Looking for any alternate suggestions. Thanks in advance!

Sample Dataframe

val inputDF= Seq(("100","John", Some(35),None)
,("100","John", None,Some("Georgia")),
("101","Mike", Some(25),None),
("101","Mike", None,Some("New York")),
("103","Mary", Some(22),None),
("103","Mary", None,Some("Texas")),
("104","Smith", Some(25),None),
("105","Jake", None,Some("Florida")))
.toDF("id","name","age","city")

Input Dataframe

+---+-----+----+--------+
|id |name |age |city    |
+---+-----+----+--------+
|100|John |35  |null    |
|100|John |null|Georgia |
|101|Mike |25  |null    |
|101|Mike |null|New York|
|103|Mary |22  |null    |
|103|Mary |null|Texas   |
|104|Smith|25  |null    |
|105|Jake |null|Florida |
+---+-----+----+--------+ 

Expected Output Dataframe

+---+-----+----+---------+
| id| name| age|     city|
+---+-----+----+---------+
|100| John|  35|  Georgia|
|101| Mike|  25| New York|
|103| Mary|  22|    Texas|
|104|Smith|  25|     null|
|105| Jake|null|  Florida|
+---+-----+----+---------+
like image 665
RmDmachine Avatar asked Oct 23 '25 15:10

RmDmachine


1 Answers

Use first or last standard functions with ignoreNulls flag on.

first standard function

val q = inputDF
  .groupBy("id", "name")
  .agg(first("age", ignoreNulls = true) as "age", first("city", ignoreNulls = true) as "city")
  .orderBy("id")

last standard function

val q = inputDF
  .groupBy("id","name")
  .agg(last("age", true) as "age", last("city") as "city")
  .orderBy("id")
like image 51
notNull Avatar answered Oct 25 '25 06:10

notNull