Does groupByKey in Spark preserve the original order?

Tags:

apache-spark

In Spark, the groupByKey function transforms a (K,V) pair RDD into a (K,Iterable<V>) pair RDD.

Yet, is this function stable? i.e is the order in the iterable preserved from the original order?

For example, if I originally read a file of the form:

K1;V11
K2;V21
K1;V12

May my iterable for K1 be like (V12, V11) (thus not preserving the original order) or can it only be (V11, V12) (thus preserving the original order)?

436

asked Jun 13 '14 13:06

1 Answers

No, the order is not preserved. Example in spark-shell:

scala> sc.parallelize(Seq(0->1, 0->2), 2).groupByKey.collect
res0: Array[(Int, Iterable[Int])] = Array((0,ArrayBuffer(2, 1)))

The order is timing dependent, so it can vary between runs. (I got the opposite order on my next run.)

What is happening here? groupByKey works by repartitioning the RDD with a HashPartitioner, so that all values for a key end in up in the same partition. Then it performs the aggregation locally on each partition.

The repartitioning is also called a "shuffle", because the lines of the RDD are redistributed between nodes. The shuffle files are pulled from the other nodes in parallel. The new partition is built from these pieces in the order that they arrive. The data from the slowest source will be at the end of the new partition, and at the end of the list in groupByKey.

(Data pulled from the worker itself is of course fastest. Since there is no network transfer involved here, this data is pulled synchronously, and thus arrives in order. (It seems to, at least.) So to replicate my experiment you need at least 2 Spark workers.)

Source: http://apache-spark-user-list.1001560.n3.nabble.com/Is-shuffle-quot-stable-quot-td7628.html

186

answered Sep 18 '22 21:09

Daniel Darabos

Related questions
                            
                                How can I transform a Map to a case class in Scala?
                            
                                In Spark, what is the right way to have a static object on all workers?
                            
                                Why it is not possible to override mutable variable in scala?
                            
                                How to have SBT re-run only failed tests
                            
                                Can extractors be customized with parameters in the body of a case statement (or anywhere else that an extractor would be used)?
                            
                                What's the recommended way to make a Scala project available to the community?
                            
                                Scala - method precedence
                            
                                Why this scala code reports compilation error: recursive value x needs type
                            
                                How do I access post data from scala play?
                            
                                Ignore DTD specification in scala
                            
                                Can I get a Scala case class definition from an Avro schema definition?
                            
                                Slick left/right/outer joins with Option
                            
                                Coalesce reduces parallelism of entire stage (spark)
                            
                                How to use java.time.LocalDate in Datasets (fails with java.lang.UnsupportedOperationException: No Encoder found)? [duplicate]
                            
                                Why am I getting this error when running Scala 2.13 tests in IntelliJ, but not with Scala 2.12?
                            
                                Scala remote actors
                            
                                Why aren't my scala futures more efficient?
                            
                                Why this type of implicit conversion is illegal?
                            
                                Akka Actor ask and Type Safety
                            
                                Why does Play action fail with "no suitable driver found" with Slick and PostgreSQL?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does groupByKey in Spark preserve the original order?

Tags:

scala

apache-spark

Jean Logeart

People also ask

1 Answers

Daniel Darabos

Recent Activity

Donate For Us