When I use <code>reduceByKey</code> or <code>aggregateByKey</code>, I'm confronted with partition problems. ex)<code>reduceBykey(_+_).map(code)</code> Especially, if input data is skewed, the partitioning problem becomes even worse when using the above methods. So, as a solution to this, I use <code>repartition</code> method. For example, http://dev.sortable.com/spark-repartition/ is similar. This is good for partition distribution, but the<code>repartition</code> is also expensive. Is there a way to solve the partition problem wisely?

You have to distinguish between two different problems: Data skew If data distribution is highly skewed (let's assume the worst case scenario with only a single unique key) then by definition the output will be skewed and changing a partitioner cannot help you. There are some techniques, which can be used to partially address the problem, but overall partitioning is not a core issue here. Partitioner bias Poorly chosen partitioning function can result in a skewed data distribution even if data is uniformly distributed. For example: <pre class="prettyprint lang-scala prettyprint-override"><code>val rdd = sc.parallelize(Seq((5, None), (10, None), (15, None), (20, None)), 5) rdd .partitionBy(new org.apache.spark.HashPartitioner(5)) .glom.map(_.size).collect </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>Array[Int] = Array(4, 0, 0, 0, 0) </code></pre> As you can see despite the fact that key distribution is not skewed, skewed has been induced by the data regularities and poor properties of the <code>hashCode</code>. In case like this choosing different <code>Partitioner</code>: <pre class="prettyprint lang-scala prettyprint-override"><code>rdd .partitionBy(new org.apache.spark.RangePartitioner(5, rdd)) .glom.map(_.size).collect </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>Array[Int] = Array(1, 1, 1, 1, 0) </code></pre> or adjusting properties of the existing one: <pre class="prettyprint lang-scala prettyprint-override"><code>rdd .partitionBy(new org.apache.spark.HashPartitioner(7)) .glom.map(_.size).collect </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>Array[Int] = Array(0, 1, 0, 1, 0, 1, 1) </code></pre> can resolve the issue.

Is there an effective partitioning method when using reduceByKey in Spark?

2 Answers

You are right,

Repartition is really expensive to run. Due to shuffles and other minor steps. Creating an example as you example said like this:

rdd.map(x => (x, x * x)).repartition(8).reduceByKey(_+_)

See the DAG here:

enter image description here

This step will create at DAG, one map, one repartition and one reduce.

But if you use the repartition inside the reduceByKey you can take a repartition for "free".

The main part of Repratition is the Shuffle, and the main part of reduceByKey is the shuffle too. You can see that in Scala lib, the reduceByKey has a numPartition parameter.

So you can change your code for this:

rdd.map(x => (x, x * x)).reduceByKey(_+_, 8)

enter image description here

And you can see the same code with the repartition in the reduceByKey it is much faster. Due to you have one less shuffle to do.

enter image description here

191

answered Nov 15 '22 12:11

Thiago Baldim

You have to distinguish between two different problems:

Data skew

If data distribution is highly skewed (let's assume the worst case scenario with only a single unique key) then by definition the output will be skewed and changing a partitioner cannot help you.

There are some techniques, which can be used to partially address the problem, but overall partitioning is not a core issue here.

Partitioner bias

Poorly chosen partitioning function can result in a skewed data distribution even if data is uniformly distributed. For example:

val rdd = sc.parallelize(Seq((5, None), (10, None), (15, None), (20, None)), 5)
rdd
  .partitionBy(new org.apache.spark.HashPartitioner(5))
  .glom.map(_.size).collect

Array[Int] = Array(4, 0, 0, 0, 0)

As you can see despite the fact that key distribution is not skewed, skewed has been induced by the data regularities and poor properties of the hashCode.

In case like this choosing different Partitioner:

rdd
  .partitionBy(new org.apache.spark.RangePartitioner(5, rdd))
  .glom.map(_.size).collect

Array[Int] = Array(1, 1, 1, 1, 0)

or adjusting properties of the existing one:

rdd
  .partitionBy(new org.apache.spark.HashPartitioner(7))
  .glom.map(_.size).collect

Array[Int] = Array(0, 1, 0, 1, 0, 1, 1)

can resolve the issue.

answered Nov 15 '22 12:11

zero323

Related questions
                            
                                Is HDFS necessary for Spark workloads?
                            
                                How to use window functions in PySpark using DataFrames?
                            
                                How to include spark tests as Maven dependency
                            
                                dataframe filter gives NullPointerException
                            
                                spark finding max value and the associated key
                            
                                Direct Kafka Stream with PySpark (Apache Spark 1.6)
                            
                                Convert Scala expression to Java 1.8
                            
                                How to set partition for Window function for PySpark?
                            
                                Kafka topic partition and Spark executor mapping
                            
                                Fetch spark job jar from Nexus
                            
                                Date Arithmetic with Multiple Columns in PySpark
                            
                                get topic from kafka message in spark
                            
                                Can sparklyr be used with spark deployed on yarn-managed hadoop cluster?
                            
                                Transforming PySpark RDD with Scala
                            
                                run spark as java web application
                            
                                Pyspark - how to do case insensitive dataframe joins?
                            
                                Spark Datasets - strong typing
                            
                                Spark Scala - How to group dataframe rows and apply complex function to the groups?
                            
                                Why does Spark exit with exitCode: 16?
                            
                                In Spark Streaming, is there a way to detect when a batch has finished?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there an effective partitioning method when using reduceByKey in Spark?

Tags:

apache-spark

rdd

partitioning

S.Kang

People also ask

2 Answers

Thiago Baldim

zero323

Recent Activity

Donate For Us