Apache Spark Handling Skewed Data

Question

I have two tables I would like to join together. One of them has a very bad skew of data. This is causing my spark job to not run in parallel as a majority of the work is done on one partition.

I have heard and read and tried to implement salting my keys to increase the distribution. https://www.youtube.com/watch?v=WyfHUNnMutg at 12:45 seconds is exactly what I would like to do.

Any help or tips would be appreciated. Thanks!

WestCoastProjects · Accepted Answer

Yes you should use salted keys on the larger table (via randomization) and then replicate the smaller one / cartesian join it to the new salted one:

Here are a couple of suggestions:

Tresata skew join RDD https://github.com/tresata/spark-skewjoin

python skew join: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

The tresata library looks like this:

import com.tresata.spark.skewjoin.Dsl._  // for the implicits   

// skewjoin() method pulled in by the implicits
rdd1.skewJoin(rdd2, defaultPartitioner(rdd1, rdd2),   
DefaultSkewReplication(1)).sortByKey(true).collect.toLis

Apache Spark Handling Skewed Data

Tags:

scala

apache-spark

hadoop

spark-dataframe

John Engelhart

1 Answers

WestCoastProjects

Recent Activity

Donate For Us

Apache Spark Handling Skewed Data

Tags:

scala

apache-spark

hadoop

spark-dataframe

John Engelhart

1 Answers

WestCoastProjects

Related questions

Recent Activity

Donate For Us