Shuffled vs non-shuffled coalesce in Apache Spark

Tags:

What is the difference between the following transformations when they are executed right before writing RDD to a file?

coalesce(1, shuffle = true)
coalesce(1, shuffle = false)

Code example:

val input = sc.textFile(inputFile)
val filtered = input.filter(doSomeFiltering)
val mapped = filtered.map(doSomeMapping)

mapped.coalesce(1, shuffle = true).saveAsTextFile(outputFile)
vs
mapped.coalesce(1, shuffle = false).saveAsTextFile(outputFile)

And how does it compare with collect()? I'm fully aware that Spark save methods will store it with HDFS-style structure, however I'm more interested in data partitioning aspects of collect() and shuffled/non-shuffled coalesce().

575

asked Jun 17 '15 18:06

Paweł Jurczenko

1 Answers

shuffle=true and shuffle=false aren't going to have any practical differences in the resulting output since they are both going down to a single partition. However, when you set it to true you will do a shuffle which isn't of any use. With shuffle=true the output is evenly distributed amongst the partitions (and your also able to increase the # of partitions if you wanted), but since your target is 1 partition, everything is ending up in one partition regardless.

As for comparision with collect(), the difference is all of the data is stored on a single executor rather than on the driver.

168

answered Oct 26 '22 05:10

Holden

Related questions
                            
                                How to share code between project and build definition project in SBT
                            
                                Scala macro to print code?
                            
                                How to parallelize an RDD?
                            
                                How to speed up Scala IDE?
                            
                                Idiomatic alternative to `if (x) Some(y) else None`
                            
                                how to know from Option[Map[String,Seq[String]]] contains key or not?
                            
                                Splitting a Comma-Separated String in Scala: Missing Trailing Empty Strings?
                            
                                Meaning of 2nd parameter in StringOps.split(String, Int)
                            
                                Multiple type parameters on a scala method
                            
                                Do you need to install Scala separately if you use sbt?
                            
                                Implicit ordering of case classes scala
                            
                                How is val in scala different from const in java?
                            
                                How to do File creation and manipulation in functional style?
                            
                                What does HList#foldLeft() return?
                            
                                Calling Java API from Scala with null argument
                            
                                In Scala, why `_` can't be used in groupBy here?
                            
                                Chunked Response from an Iterator with Play Framework in Scala
                            
                                Unexpected Result when Overriding 'val'
                            
                                Jackson mapper with generic class in scala
                            
                                Intellij does not recognize Scala List operator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Shuffled vs non-shuffled coalesce in Apache Spark

Tags:

scala

distributed-computing

apache-spark

Paweł Jurczenko

People also ask

1 Answers

Holden

Recent Activity

Donate For Us