Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform union on two DataFrames with different amounts of columns in spark?

I have 2 DataFrames:

Source data

I need union like this:

enter image description here

The unionAll function doesn't work because the number and the name of columns are different.

How can I do this?

like image 809
Allan Feliph Avatar asked Sep 28 '16 21:09

Allan Feliph


People also ask

How do I merge two DataFrames with different columns in Spark?

Here In first dataframe (dataframe1) , the columns ['ID', 'NAME', 'Address'] and second dataframe (dataframe2 ) columns are ['ID','Age']. Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.

How do you do a union of multiple DataFrames in PySpark?

The PySpark unionByName() function is also used to combine two or more data frames but it might be used to combine dataframes having different schema. This is because it combines data frames by the name of the column and not the order of the columns. Where, data_frame1 and data_frame2 are the dataframes.

How do you Union two DataFrames in pandas with different column names?

Different column names are specified for merges in Pandas using the “left_on” and “right_on” parameters, instead of using only the “on” parameter. Merging dataframes with different names for the joining variable is achieved using the left_on and right_on arguments to the pandas merge function.


1 Answers

In Scala you just have to append all missing columns as nulls.

import org.apache.spark.sql.functions._  // let df1 and df2 the Dataframes to merge val df1 = sc.parallelize(List(   (50, 2),   (34, 4) )).toDF("age", "children")  val df2 = sc.parallelize(List(   (26, true, 60000.00),   (32, false, 35000.00) )).toDF("age", "education", "income")  val cols1 = df1.columns.toSet val cols2 = df2.columns.toSet val total = cols1 ++ cols2 // union  def expr(myCols: Set[String], allCols: Set[String]) = {   allCols.toList.map(x => x match {     case x if myCols.contains(x) => col(x)     case _ => lit(null).as(x)   }) }  df1.select(expr(cols1, total):_*).unionAll(df2.select(expr(cols2, total):_*)).show()  +---+--------+---------+-------+ |age|children|education| income| +---+--------+---------+-------+ | 50|       2|     null|   null| | 34|       4|     null|   null| | 26|    null|     true|60000.0| | 32|    null|    false|35000.0| +---+--------+---------+-------+ 

Update

Both temporal DataFrames will have the same order of columns, because we are mapping through total in both cases.

df1.select(expr(cols1, total):_*).show() df2.select(expr(cols2, total):_*).show()  +---+--------+---------+------+ |age|children|education|income| +---+--------+---------+------+ | 50|       2|     null|  null| | 34|       4|     null|  null| +---+--------+---------+------+  +---+--------+---------+-------+ |age|children|education| income| +---+--------+---------+-------+ | 26|    null|     true|60000.0| | 32|    null|    false|35000.0| +---+--------+---------+-------+ 
like image 151
Alberto Bonsanto Avatar answered Sep 20 '22 14:09

Alberto Bonsanto