In Spark, the groupByKey function transforms a (K,V)
pair RDD into a (K,Iterable<V>)
pair RDD.
Yet, is this function stable? i.e is the order in the iterable preserved from the original order?
For example, if I originally read a file of the form:
K1;V11
K2;V21
K1;V12
May my iterable for K1
be like (V12, V11)
(thus not preserving the original order) or can it only be (V11, V12)
(thus preserving the original order)?
It definitely does reordering.
The GroupByKey function in apache spark is defined as the frequently used transformation operation that shuffles the data. The GroupByKey function receives key-value pairs or (K, V) as its input and group the values based on the key, and finally, it generates a dataset of (K, Iterable) pairs as its output.
groupBy after orderBy doesn't maintain order, as others have pointed out. What you want to do is use a Window function, partitioned on id and ordered by hours.
reduceByKey will aggregate y key before shuffling, and groupByKey will shuffle all the value key pairs as the diagrams show. On large size data the difference is obvious.
No, the order is not preserved. Example in spark-shell
:
scala> sc.parallelize(Seq(0->1, 0->2), 2).groupByKey.collect
res0: Array[(Int, Iterable[Int])] = Array((0,ArrayBuffer(2, 1)))
The order is timing dependent, so it can vary between runs. (I got the opposite order on my next run.)
What is happening here? groupByKey
works by repartitioning the RDD with a HashPartitioner
, so that all values for a key end in up in the same partition. Then it performs the aggregation locally on each partition.
The repartitioning is also called a "shuffle", because the lines of the RDD are redistributed between nodes. The shuffle files are pulled from the other nodes in parallel. The new partition is built from these pieces in the order that they arrive. The data from the slowest source will be at the end of the new partition, and at the end of the list in groupByKey
.
(Data pulled from the worker itself is of course fastest. Since there is no network transfer involved here, this data is pulled synchronously, and thus arrives in order. (It seems to, at least.) So to replicate my experiment you need at least 2 Spark workers.)
Source: http://apache-spark-user-list.1001560.n3.nabble.com/Is-shuffle-quot-stable-quot-td7628.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With