How to perform union on two DataFrames with different amounts of columns in spark?

1 Answers

In Scala you just have to append all missing columns as nulls.

import org.apache.spark.sql.functions._  // let df1 and df2 the Dataframes to merge val df1 = sc.parallelize(List(   (50, 2),   (34, 4) )).toDF("age", "children")  val df2 = sc.parallelize(List(   (26, true, 60000.00),   (32, false, 35000.00) )).toDF("age", "education", "income")  val cols1 = df1.columns.toSet val cols2 = df2.columns.toSet val total = cols1 ++ cols2 // union  def expr(myCols: Set[String], allCols: Set[String]) = {   allCols.toList.map(x => x match {     case x if myCols.contains(x) => col(x)     case _ => lit(null).as(x)   }) }  df1.select(expr(cols1, total):_*).unionAll(df2.select(expr(cols2, total):_*)).show()  +---+--------+---------+-------+ |age|children|education| income| +---+--------+---------+-------+ | 50|       2|     null|   null| | 34|       4|     null|   null| | 26|    null|     true|60000.0| | 32|    null|    false|35000.0| +---+--------+---------+-------+

Update

Both temporal DataFrames will have the same order of columns, because we are mapping through total in both cases.

df1.select(expr(cols1, total):_*).show() df2.select(expr(cols2, total):_*).show()  +---+--------+---------+------+ |age|children|education|income| +---+--------+---------+------+ | 50|       2|     null|  null| | 34|       4|     null|  null| +---+--------+---------+------+  +---+--------+---------+-------+ |age|children|education| income| +---+--------+---------+-------+ | 26|    null|     true|60000.0| | 32|    null|    false|35000.0| +---+--------+---------+-------+

151

answered Sep 20 '22 14:09

Alberto Bonsanto

Related questions
                            
                                ImportError: No module named sqlalchemy
                            
                                Are there any reasons not to use an OrderedDict?
                            
                                Run Python script without Windows console appearing
                            
                                datetime to string with series in pandas
                            
                                Interpolate NaN values in a numpy array
                            
                                difference between StratifiedKFold and StratifiedShuffleSplit in sklearn
                            
                                How to convert base64 string to image? [duplicate]
                            
                                Matplotlib - add colorbar to a sequence of line plots
                            
                                Is there a way to undo a migration on Django and uncheck it from the list of showmigrations?
                            
                                ValueError: Missing staticfiles manifest entry for 'favicon.ico'
                            
                                Python Add Comma Into Number String
                            
                                The role defined for the function cannot be assumed by Lambda
                            
                                Non-blocking console input?
                            
                                Use and meaning of "in" in an if statement?
                            
                                Decrementing for loops [duplicate]
                            
                                DictCursor doesn't seem to work under psycopg2
                            
                                Check element exists in array
                            
                                ipython server can't launch: No module named notebook.notebookapp
                            
                                Django Admin: Using a custom widget for only one model field
                            
                                matching query does not exist Error in Django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to perform union on two DataFrames with different amounts of columns in spark?

Tags:

python

apache-spark

apache-spark-sql

pyspark

pyspark-dataframes

Allan Feliph

People also ask

1 Answers

Update

Alberto Bonsanto

Recent Activity

Donate For Us