Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing two data frames in Spark (performance)

I need to compare two dataframes in my spark application. I went through the following post. How to obtain the difference between two DataFrames?

However, I don't understand why the approach in the best answer

df1.unionAll(df2).except(df1.intersect(df2))

is better than the one in the question

df1.except(df2).union(df2.except(df1))

Can anyone explain? As per my understanding, the latter works with two smaller datasets and former works with a large dataset. Is it because the latter does a distinct as a part of union? Even then, if it is more likely case that two data frames have same records, we are dealing with a small dataset in the latter case.

like image 619
Ajay Vepakomma Avatar asked Jan 07 '19 15:01

Ajay Vepakomma


Video Answer


1 Answers

Let's consider a scenario where both df1 and df2 (of size N and M respectively) are too large to be broadcasted, but there is no overlap between df1 and df2.

Let's call it the result di. In such case df1.intersect(df2) will require a full shuffle of N + M rows, however the size of the output will be equal to 0. In such case df1.unionAll(df2).except(di) can be executed as a broadcast join (such optimization might require adaptive execution unless specific plan is forced by the user). It is also important to note that such plan doesn't require caching.

In contrast the cost of df1.except(df2).union(df2.except(df1)) will be constant in regards to the cardinality of the intersection.

At the same time, if d1 is to large to be broadcasted, it already has a partitioning compatible with except, so the remaining query shouldn't require additional shuffle.

like image 100
10465355 Avatar answered Oct 25 '22 22:10

10465355