Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Union list of pyspark dataframes

Let's say I have a list of pyspark dataframes: [df1, df2, ...], what I want is to union them (so actually do df1.union(df2).union(df3).... What's the best practice to achieve that?

like image 826
mihagazvoda Avatar asked Dec 22 '25 00:12

mihagazvoda


2 Answers

you could use the reduce and pass the union function along with the list of dataframes.

import pyspark
from functools import reduce

list_of_sdf = [df1, df2, ...]
final_sdf = reduce(pyspark.sql.dataframe.DataFrame.unionByName, list_of_sdf)

the final_sdf will have the appended data.

like image 135
samkart Avatar answered Dec 24 '25 19:12

samkart


When some data frames have missing columns, use a partially applied function:

from functools import reduce, partial
from pyspark.sql import DataFrame

# Union dataframes by name (missing columns filled with null) 
union_by_name = partial(DataFrame.unionByName, allowMissingColumns=True)
df_output = reduce(union_by_name, [df1, df2, ...])
like image 22
saza Avatar answered Dec 24 '25 19:12

saza



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!