Spark Dataset/Dataframe join NULL skew key

Question

Working with Spark Dataset/DataFrame joins, I faced long running and failed with OOM jobs.

Here's input:

~10 datasets with different size, mostly huge(>1 TB)
all left-joined to one base dataset
some of join keys are null

After some analysis, I found that failed and slow jobs reason is null skew key: when left side has millions of records with join key null.

I made some brute force approach to solve this issue, and here's I want to share it.

If you have better or any built-in solutions(for regular Apache Spark), please share it.

Henrique dos Santos Goulart · Accepted Answer

I had the same problem a time ago but I choose another approach after making some performance tests. It depends of your data, the data will tell you what is the better algorithm to solve this join problem.

In my case, I have more than 30% of data with null in the left side of join and the data is in parquet format. Given that, it's better for me to perform a filter where this key is null and where this key is not null, join only when not null, and later union both data.

val data = ...
val notJoinable = data.filter('keyToJoin.isNull)
val joinable = data.filter('keyToJoin.isNotNull)

joinable.join(...) union notJoinable

It avoids hotspot too. If I use your approach (negative numbers/whatever not-"joinable" value), spark will shuffle all this data which is a lot of data (more than 30%).

Just trying to show you another approach for your problem,

Spark Dataset/Dataframe join NULL skew key

Tags:

apache-spark

apache-spark-sql

skew

Mikhail Dubkov

1 Answers

Henrique dos Santos Goulart

Recent Activity

Donate For Us

Spark Dataset/Dataframe join NULL skew key

Tags:

apache-spark

apache-spark-sql

skew

Mikhail Dubkov

1 Answers

Henrique dos Santos Goulart

Related questions

Recent Activity

Donate For Us