Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does groupByKey in Spark preserve the original order?

In Spark, the groupByKey function transforms a (K,V) pair RDD into a (K,Iterable<V>) pair RDD.

Yet, is this function stable? i.e is the order in the iterable preserved from the original order?

For example, if I originally read a file of the form:

K1;V11
K2;V21
K1;V12

May my iterable for K1 be like (V12, V11) (thus not preserving the original order) or can it only be (V11, V12) (thus preserving the original order)?

like image 436
Jean Logeart Avatar asked Jun 13 '14 13:06

Jean Logeart


People also ask

Does GroupByKey preserve order?

It definitely does reordering.

How does GroupByKey work in spark?

The GroupByKey function in apache spark is defined as the frequently used transformation operation that shuffles the data. The GroupByKey function receives key-value pairs or (K, V) as its input and group the values based on the key, and finally, it generates a dataset of (K, Iterable) pairs as its output.

Does groupBy preserve order Pyspark?

groupBy after orderBy doesn't maintain order, as others have pointed out. What you want to do is use a Window function, partitioned on id and ordered by hours.

What is the difference between GroupByKey () and reduceByKey () transformations in spark?

reduceByKey will aggregate y key before shuffling, and groupByKey will shuffle all the value key pairs as the diagrams show. On large size data the difference is obvious.


1 Answers

No, the order is not preserved. Example in spark-shell:

scala> sc.parallelize(Seq(0->1, 0->2), 2).groupByKey.collect
res0: Array[(Int, Iterable[Int])] = Array((0,ArrayBuffer(2, 1)))

The order is timing dependent, so it can vary between runs. (I got the opposite order on my next run.)

What is happening here? groupByKey works by repartitioning the RDD with a HashPartitioner, so that all values for a key end in up in the same partition. Then it performs the aggregation locally on each partition.

The repartitioning is also called a "shuffle", because the lines of the RDD are redistributed between nodes. The shuffle files are pulled from the other nodes in parallel. The new partition is built from these pieces in the order that they arrive. The data from the slowest source will be at the end of the new partition, and at the end of the list in groupByKey.

(Data pulled from the worker itself is of course fastest. Since there is no network transfer involved here, this data is pulled synchronously, and thus arrives in order. (It seems to, at least.) So to replicate my experiment you need at least 2 Spark workers.)

Source: http://apache-spark-user-list.1001560.n3.nabble.com/Is-shuffle-quot-stable-quot-td7628.html

like image 186
Daniel Darabos Avatar answered Sep 18 '22 21:09

Daniel Darabos