Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark union column order

I've come across something strange recently in Spark. As far as I understand, given the column based storage method of spark dfs, the order of the columns really don't have any meaning, they're like keys in a dictionary.

During a df.union(df2), does the order of the columns matter? I would've assumed that it shouldn't, but according to the wisdom of sql forums it does.

So we have df1

df1
|  a|   b|
+---+----+
|  1| asd|
|  2|asda|
|  3| f1f|
+---+----+

df2
|   b|  a|
+----+---+
| asd|  1|
|asda|  2|
| f1f|  3|
+----+---+

result
|   a|   b|
+----+----+
|   1| asd|
|   2|asda|
|   3| f1f|
| asd|   1|
|asda|   2|
| f1f|   3|
+----+----+

It looks like the schema from df1 was used, but the data appears to have joined following the order of their original dataframes. Obviously the solution would be to do df1.union(df2.select(df1.columns))

But the main question is, why does it do this? Is it simply because it's part of pyspark.sql, or is there some underlying data architecture in Spark that I've goofed up in understanding?

code to create test set if anyone wants to try

d1={'a':[1,2,3], 'b':['asd','asda','f1f']}
d2={ 'b':['asd','asda','f1f'], 'a':[1,2,3],}
pdf1=pd.DataFrame(d1)
pdf2=pd.DataFrame(d2)
df1=spark.createDataFrame(pdf1)
df2=spark.createDataFrame(pdf2)
test=df1.union(df2)
like image 371
Charles Du Avatar asked Jul 08 '19 20:07

Charles Du


People also ask

Does column order matter in Spark Union?

I've come across something strange recently in Spark. As far as I understand, given the column based storage method of spark dfs, the order of the columns really don't have any meaning, they're like keys in a dictionary.

How do I order columns in Spark?

In Spark, you can use either sort() or orderBy() function of DataFrame/Dataset to sort by ascending or descending order based on single or multiple columns, you can also do sorting using Spark SQL sorting functions, In this article, I will explain all these different ways using Scala examples.


2 Answers

The Spark union is implemented according to standard SQL and therefore resolves the columns by position. This is also stated by the API documentation:

Return a new DataFrame containing union of rows in this and another frame.

This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does >deduplication of elements), use this function followed by a distinct.

Also as standard in SQL, this function resolves columns by position (not by name).

Since Spark >= 2.3 you can use unionByName to union two dataframes were the column names get resolved.

like image 66
cronoik Avatar answered Oct 23 '22 23:10

cronoik


in spark Union is not done on metadata of columns and data is not shuffled like you would think it would. rather union is done on the column numbers as in, if you are unioning 2 Df's both must have the same numbers of columns..you will have to take in consideration of positions of your columns previous to doing union. unlike SQL or Oracle or other RDBMS, underlying files in spark are physical files. hope that answers your question

like image 13
Aaron Avatar answered Oct 24 '22 00:10

Aaron