According to Learning Spark <blockquote> Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of <code>repartition()</code> called <code>coalesce()</code> that allows avoiding data movement, but only if you are decreasing the number of RDD partitions. </blockquote> One difference I get is that with <code>repartition()</code> the number of partitions can be increased/decreased, but with <code>coalesce()</code> the number of partitions can only be decreased. If the partitions are spread across multiple machines and <code>coalesce()</code> is run, how can it avoid data movement?

It avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept. So, it would go something like this: <pre class="prettyprint"><code>Node 1 = 1,2,3 Node 2 = 4,5,6 Node 3 = 7,8,9 Node 4 = 10,11,12 </code></pre> Then <code>coalesce</code> down to 2 partitions: <pre class="prettyprint"><code>Node 1 = 1,2,3 + (10,11,12) Node 3 = 7,8,9 + (4,5,6) </code></pre> Notice that Node 1 and Node 3 did not require its original data to move.

Spark - repartition() vs coalesce()

Tags:

distributed-computing

apache-spark

rdd

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.

One difference I get is that with repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased.

If the partitions are spread across multiple machines and coalesce() is run, how can it avoid data movement?

572

asked Jul 24 '15 12:07

Praveen Sripati

1 Answers

It avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

So, it would go something like this:

Node 1 = 1,2,3 Node 2 = 4,5,6 Node 3 = 7,8,9 Node 4 = 10,11,12

Then coalesce down to 2 partitions:

Node 1 = 1,2,3 + (10,11,12) Node 3 = 7,8,9 + (4,5,6)

Notice that Node 1 and Node 3 did not require its original data to move.

191

answered Oct 17 '22 13:10

Justin Pihony

Related questions
                            
                                How to sort by column in descending order in Spark SQL?
                            
                                How to turn off INFO logging in Spark?
                            
                                How do I add a new column to a Spark DataFrame (using PySpark)?
                            
                                How can I change column types in Spark SQL's DataFrame?
                            
                                How to add a constant column in a Spark DataFrame?
                            
                                How to select the first row of each group?
                            
                                How to read multiple text files into a single RDD?
                            
                                Add jars to a Spark Job - spark-submit
                            
                                (Why) do we need to call cache or persist on a RDD
                            
                                Spark performance for Scala vs Python
                            
                                How to stop INFO messages displaying on spark console?
                            
                                Apache Spark: The number of cores vs. the number of executors
                            
                                What is the difference between cache and persist?
                            
                                Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects
                            
                                Spark java.lang.OutOfMemoryError: Java heap space
                            
                                What are workers, executors, cores in Spark Standalone cluster?
                            
                                How to change dataframe column names in pyspark?
                            
                                How to show full column content in a Spark Dataframe?
                            
                                What is the difference between map and flatMap and a good use case for each?
                            
                                Difference between DataFrame, Dataset, and RDD in Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With