Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where is the union() method on the Spark DataFrame class?

I am using the Java connector for Spark and would like to union two DataFrames but bizarrely the DataFrame class has only unionAll? Is this intentional and is there a way to union two DataFrames without duplicates?

like image 301
Milen Kovachev Avatar asked Jan 25 '16 12:01

Milen Kovachev


1 Answers

Is this intentional

If think it is safe to assume that it is intentional. Other union operators like RDD.union and DataSet.union will keep duplicates as well.

If you think about it make sense. While operation equivalent to UNION ALL is just a logical operation which requires no data access or network traffic finding distinct elements requires shuffle and because of that can be quite expensive.

is there a way to union two DataFrames without duplicates?

df1.unionAll(df2).distinct()
like image 150
zero323 Avatar answered Oct 09 '22 23:10

zero323