Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - repartition() vs coalesce()

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.

One difference I get is that with repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased.

If the partitions are spread across multiple machines and coalesce() is run, how can it avoid data movement?

like image 572
Praveen Sripati Avatar asked Jul 24 '15 12:07

Praveen Sripati


People also ask

Is repartition faster than coalesce?

repartition redistributes the data evenly, but at the cost of a shuffle. coalesce works much faster when you reduce the number of partitions because it sticks input partitions together. coalesce doesn't guarantee uniform data distribution. coalesce is identical to a repartition when you increase the number of ...

In what scenario we will use coalesce and repartition?

repartition(1) and coalesce(1) can be used to write out DataFrames to single files. You won't typically want to write out data to a single file because it's slow (and will error out if the dataset is big). You'll only want to write out data to a single file when the DataFrame is tiny.

Can coalesce increase partitions in spark?

The coalesce reduces the number of partitions in a DataFrame. Coalesce avoids complete shuffle; instead of creating new partitions, it shuffles the data using Hash Partitioner (Default) and adjusts into existing partitions.

Why coalesce is used in spark?

PySpark Coalesce is a function in PySpark that is used to work with the partition data in a PySpark Data Frame. The Coalesce method is used to decrease the number of partitions in a Data Frame; The coalesce function avoids the full shuffling of data.


1 Answers

It avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

So, it would go something like this:

Node 1 = 1,2,3 Node 2 = 4,5,6 Node 3 = 7,8,9 Node 4 = 10,11,12 

Then coalesce down to 2 partitions:

Node 1 = 1,2,3 + (10,11,12) Node 3 = 7,8,9 + (4,5,6) 

Notice that Node 1 and Node 3 did not require its original data to move.

like image 191
Justin Pihony Avatar answered Oct 17 '22 13:10

Justin Pihony