Does a flatMap in spark cause a shuffle?

1 Answers

There is no shuffling with either map or flatMap. The operations that cause shuffle are:

Repartition operations:
- Repartition:
- Coalesce:
ByKey operations (except for counting):
- GroupByKey:
- ReduceByKey:
Join operations:
- Cogroup:
- Join:

Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:

mapPartitions to sort each partition using, for example, .sorted
repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning
sortBy to make a globally ordered RDD

More info here: http://spark.apache.org/docs/latest/programming-guide.html#shuffle-operations

128

answered Sep 20 '22 22:09

JorgeGlezLopez

Related questions
                            
                                Manifest[T].erasure is deprecated in 2.10, what should I use now?
                            
                                How to run code in a separate thread?
                            
                                Sending a post request in spray
                            
                                how to pass configuration file to scala jar file
                            
                                Conditional methods of Scala generic classes with restrictions for type parameters
                            
                                How to use mocks with the Cake Pattern
                            
                                How can I test Java programs with ScalaCheck?
                            
                                Kryo serialization refuses to register class
                            
                                Handling Faults in Akka actors
                            
                                scala:console is worse than Scala's own REPL?
                            
                                Efficient nearest neighbour search in Scala
                            
                                Akka testing supervisor error handling
                            
                                Why are scaladoc method signatures wrong?
                            
                                Why can't _ be used to indicate an unused/ignored argument in a method override?
                            
                                Travis CI ignoring MAVEN_OPTS?
                            
                                Spark JSON text field to RDD
                            
                                scala : it is impossible to put a tuple as a function's argument
                            
                                Spark: scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
                            
                                Function implicit parameters not any more so after passing it to a higher order function
                            
                                Shading over third party classes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does a flatMap in spark cause a shuffle?

Tags:

scala

apache-spark

bigdata

pythonic

People also ask

1 Answers

JorgeGlezLopez

Recent Activity

Donate For Us