Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shuffled vs non-shuffled coalesce in Apache Spark

What is the difference between the following transformations when they are executed right before writing RDD to a file?

  1. coalesce(1, shuffle = true)
  2. coalesce(1, shuffle = false)

Code example:

val input = sc.textFile(inputFile)
val filtered = input.filter(doSomeFiltering)
val mapped = filtered.map(doSomeMapping)

mapped.coalesce(1, shuffle = true).saveAsTextFile(outputFile)
vs
mapped.coalesce(1, shuffle = false).saveAsTextFile(outputFile)

And how does it compare with collect()? I'm fully aware that Spark save methods will store it with HDFS-style structure, however I'm more interested in data partitioning aspects of collect() and shuffled/non-shuffled coalesce().

like image 575
Paweł Jurczenko Avatar asked Jun 17 '15 18:06

Paweł Jurczenko


People also ask

Does coalesce cause shuffle in Spark?

Coalesce does not involve a shuffle.

Does coalesce shuffle?

The repartition() can be used to increase or decrease the number of partitions, but it involves heavy data shuffling across the cluster. On the other hand, coalesce() can be used only to decrease the number of partitions. In most of the cases, coalesce() does not trigger a shuffle.

Why do we need shuffling in Spark?

The Spark SQL shuffle is a mechanism for redistributing or re-partitioning data so that the data is grouped differently across partitions, based on your data size you may need to reduce or increase the number of partitions of RDD/DataFrame using spark.

Which is better coalesce or repartition in Spark?

Spark Repartition() vs Coalesce() Spark repartition() vs coalesce() – repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the coalesce() is used to only decrease the number of partitions in an efficient way.


1 Answers

shuffle=true and shuffle=false aren't going to have any practical differences in the resulting output since they are both going down to a single partition. However, when you set it to true you will do a shuffle which isn't of any use. With shuffle=true the output is evenly distributed amongst the partitions (and your also able to increase the # of partitions if you wanted), but since your target is 1 partition, everything is ending up in one partition regardless.

As for comparision with collect(), the difference is all of the data is stored on a single executor rather than on the driver.

like image 168
Holden Avatar answered Oct 26 '22 05:10

Holden