I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenario?
A skew join is used when there is a table with skew data in the joining column. A skew table is a table that is having values that are present in large numbers in the table compared to other data. Skew data is stored in a separate file while the rest of the data is stored in a separate file.
Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Short version:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With