Comparing two data frames in Spark (performance)

Question

I need to compare two dataframes in my spark application. I went through the following post. How to obtain the difference between two DataFrames?

However, I don't understand why the approach in the best answer

df1.unionAll(df2).except(df1.intersect(df2))

is better than the one in the question

df1.except(df2).union(df2.except(df1))

Can anyone explain? As per my understanding, the latter works with two smaller datasets and former works with a large dataset. Is it because the latter does a distinct as a part of union? Even then, if it is more likely case that two data frames have same records, we are dealing with a small dataset in the latter case.

10465355 · Accepted Answer

Let's consider a scenario where both df1 and df2 (of size N and M respectively) are too large to be broadcasted, but there is no overlap between df1 and df2.

Let's call it the result di. In such case df1.intersect(df2) will require a full shuffle of N + M rows, however the size of the output will be equal to 0. In such case df1.unionAll(df2).except(di) can be executed as a broadcast join (such optimization might require adaptive execution unless specific plan is forced by the user). It is also important to note that such plan doesn't require caching.

In contrast the cost of df1.except(df2).union(df2.except(df1)) will be constant in regards to the cardinality of the intersection.

At the same time, if d1 is to large to be broadcasted, it already has a partitioning compatible with except, so the remaining query shouldn't require additional shuffle.

Comparing two data frames in Spark (performance)

Tags:

java

performance

scala

apache-spark

apache-spark-sql

Ajay Vepakomma

Video Answer

1 Answers

10465355

Recent Activity

Donate For Us

Comparing two data frames in Spark (performance)

Tags:

java

performance

scala

apache-spark

apache-spark-sql

Ajay Vepakomma

Video Answer

1 Answers

10465355

Related questions

Recent Activity

Donate For Us