I am using the Java connector for Spark and would like to union two DataFrames but bizarrely the DataFrame class has only unionAll? Is this intentional and is there a way to union two DataFrames without duplicates?
Is this intentional
If think it is safe to assume that it is intentional. Other union operators like RDD.union
and DataSet.union
will keep duplicates as well.
If you think about it make sense. While operation equivalent to UNION ALL
is just a logical operation which requires no data access or network traffic finding distinct elements requires shuffle and because of that can be quite expensive.
is there a way to union two DataFrames without duplicates?
df1.unionAll(df2).distinct()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With