Spark union column order

Tags:

I've come across something strange recently in Spark. As far as I understand, given the column based storage method of spark dfs, the order of the columns really don't have any meaning, they're like keys in a dictionary.

During a df.union(df2), does the order of the columns matter? I would've assumed that it shouldn't, but according to the wisdom of sql forums it does.

So we have df1

df1
|  a|   b|
+---+----+
|  1| asd|
|  2|asda|
|  3| f1f|
+---+----+

df2
|   b|  a|
+----+---+
| asd|  1|
|asda|  2|
| f1f|  3|
+----+---+

result
|   a|   b|
+----+----+
|   1| asd|
|   2|asda|
|   3| f1f|
| asd|   1|
|asda|   2|
| f1f|   3|
+----+----+

It looks like the schema from df1 was used, but the data appears to have joined following the order of their original dataframes. Obviously the solution would be to do df1.union(df2.select(df1.columns))

But the main question is, why does it do this? Is it simply because it's part of pyspark.sql, or is there some underlying data architecture in Spark that I've goofed up in understanding?

code to create test set if anyone wants to try

d1={'a':[1,2,3], 'b':['asd','asda','f1f']}
d2={ 'b':['asd','asda','f1f'], 'a':[1,2,3],}
pdf1=pd.DataFrame(d1)
pdf2=pd.DataFrame(d2)
df1=spark.createDataFrame(pdf1)
df2=spark.createDataFrame(pdf2)
test=df1.union(df2)

371

asked Jul 08 '19 20:07

Charles Du

2 Answers

The Spark union is implemented according to standard SQL and therefore resolves the columns by position. This is also stated by the API documentation:

Return a new DataFrame containing union of rows in this and another frame.

This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does >deduplication of elements), use this function followed by a distinct.

Also as standard in SQL, this function resolves columns by position (not by name).

Since Spark >= 2.3 you can use unionByName to union two dataframes were the column names get resolved.

answered Oct 23 '22 23:10

cronoik

in spark Union is not done on metadata of columns and data is not shuffled like you would think it would. rather union is done on the column numbers as in, if you are unioning 2 Df's both must have the same numbers of columns..you will have to take in consideration of positions of your columns previous to doing union. unlike SQL or Oracle or other RDBMS, underlying files in spark are physical files. hope that answers your question

answered Oct 24 '22 00:10

Aaron

Related questions
                            
                                Convert a standard python key value dictionary list to pyspark data frame
                            
                                Spark Parallelize? (Could not find creator property with name 'id')
                            
                                What are SparkSession Config Options
                            
                                How createCombiner,mergeValue, mergeCombiner works in CombineByKey in Spark ( Using Scala)
                            
                                How to explode multiple columns of a dataframe in pyspark
                            
                                'Operation timed out' error on trying to ssh in to the Amazon EMR Spark Cluster
                            
                                Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column
                            
                                Can PySpark work without Spark?
                            
                                Does spark predicate pushdown work with JDBC?
                            
                                How do I get a SQL row_number equivalent for a Spark RDD?
                            
                                Understanding spark physical plan
                            
                                AssertionError: col should be Column
                            
                                Encode and assemble multiple features in PySpark
                            
                                Condition in map function
                            
                                How to calculate sum and count in a single groupBy?
                            
                                How to create a udf in PySpark which returns an array of strings?
                            
                                Why does starting StreamingContext fail with “IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute”?
                            
                                Rolling your own reduceByKey in Spark Dataset
                            
                                In Apache Spark, why does RDD.union not preserve the partitioner?
                            
                                PySpark and broadcast join example

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark union column order

Tags:

apache-spark

apache-spark-sql

pyspark

pyspark-sql

Charles Du

People also ask

2 Answers

cronoik

Aaron

Recent Activity

Donate For Us