Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Dataset/Dataframe join NULL skew key

Working with Spark Dataset/DataFrame joins, I faced long running and failed with OOM jobs.

Here's input:

  • ~10 datasets with different size, mostly huge(>1 TB)
  • all left-joined to one base dataset
  • some of join keys are null

After some analysis, I found that failed and slow jobs reason is null skew key: when left side has millions of records with join key null.

I made some brute force approach to solve this issue, and here's I want to share it.

If you have better or any built-in solutions(for regular Apache Spark), please share it.

like image 622
Mikhail Dubkov Avatar asked Mar 05 '23 10:03

Mikhail Dubkov


1 Answers

I had the same problem a time ago but I choose another approach after making some performance tests. It depends of your data, the data will tell you what is the better algorithm to solve this join problem.

In my case, I have more than 30% of data with null in the left side of join and the data is in parquet format. Given that, it's better for me to perform a filter where this key is null and where this key is not null, join only when not null, and later union both data.

val data = ...
val notJoinable = data.filter('keyToJoin.isNull)
val joinable = data.filter('keyToJoin.isNotNull)

joinable.join(...) union notJoinable

It avoids hotspot too. If I use your approach (negative numbers/whatever not-"joinable" value), spark will shuffle all this data which is a lot of data (more than 30%).

Just trying to show you another approach for your problem,

like image 181
Henrique dos Santos Goulart Avatar answered Mar 29 '23 01:03

Henrique dos Santos Goulart