Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Skewed dataset join in Spark?

I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenario?

like image 217
Raj Kumar Avatar asked Nov 02 '16 06:11

Raj Kumar


People also ask

What is skewed join?

A skew join is used when there is a table with skew data in the joining column. A skew table is a table that is having values that are present in large numbers in the table compared to other data. Skew data is stored in a separate file while the rest of the data is stored in a separate file.


1 Answers

Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

Short version:

  • Add random element to large RDD and create new join key with it
  • Add random element to small RDD using explode/flatMap to increase number of entries and create new join key
  • Join RDDs on new join key which will now be distributed better due to random seeding
like image 145
LiMuBei Avatar answered Oct 14 '22 04:10

LiMuBei