Working with Spark Dataset/DataFrame joins, I faced long running and failed with OOM jobs.
Here's input:
null
After some analysis, I found that failed and slow jobs reason is null
skew key: when left side has millions of records with join key null
.
I made some brute force approach to solve this issue, and here's I want to share it.
If you have better or any built-in solutions(for regular Apache Spark), please share it.
I had the same problem a time ago but I choose another approach after making some performance tests. It depends of your data, the data will tell you what is the better algorithm to solve this join problem.
In my case, I have more than 30% of data with null in the left side of join and the data is in parquet format. Given that, it's better for me to perform a filter
where this key is null and where this key is not null, join only when not null, and later union both data.
val data = ...
val notJoinable = data.filter('keyToJoin.isNull)
val joinable = data.filter('keyToJoin.isNotNull)
joinable.join(...) union notJoinable
It avoids hotspot too. If I use your approach (negative numbers/whatever not-"joinable" value), spark will shuffle all this data which is a lot of data (more than 30%).
Just trying to show you another approach for your problem,
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With