I need to compare two dataframes in my spark application. I went through the following post. How to obtain the difference between two DataFrames?
However, I don't understand why the approach in the best answer
df1.unionAll(df2).except(df1.intersect(df2))
is better than the one in the question
df1.except(df2).union(df2.except(df1))
Can anyone explain? As per my understanding, the latter works with two smaller datasets and former works with a large dataset. Is it because the latter does a distinct as a part of union? Even then, if it is more likely case that two data frames have same records, we are dealing with a small dataset in the latter case.
Let's consider a scenario where both df1
and df2
(of size N and M respectively) are too large to be broadcasted, but there is no overlap between df1
and df2
.
Let's call it the result di
. In such case df1.intersect(df2)
will require a full shuffle of N + M rows, however the size of the output will be equal to 0. In such case df1.unionAll(df2).except(di)
can be executed as a broadcast join (such optimization might require adaptive execution unless specific plan is forced by the user). It is also important to note that such plan doesn't require caching.
In contrast the cost of df1.except(df2).union(df2.except(df1))
will be constant in regards to the cardinality of the intersection.
At the same time, if d1
is to large to be broadcasted, it already has a partitioning compatible with except
, so the remaining query shouldn't require additional shuffle.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With