I have a Pair RDD (K, V)
with the key containing a time
and an ID
. I would like to get a Pair RDD of the form (K, Iterable<V>)
where the keys are groupped by id and the iterable is ordered by time.
I'm currently using sortByKey().groupByKey()
and my tests seem to prove it works, however I'm reading that it may not always be the case, as discussed in this question with diverging answers ( Does groupByKey in Spark preserve the original order? ).
Is it correct or not?
Thanks!
Both sort () and orderBy () functions can be used to sort Spark DataFrames on at least one column and any desired order, namely ascending or descending. sort () is more efficient compared to orderBy () because the data is sorted on each partition individually and this is why the order in the output data is not guaranteed.
Spark sortByKey () transformation is an RDD operation that is used to sort the values of the key by ascending or descending order. sortByKey () function operates on pair RDD (key/value pair) and it is available in org.apache.spark.rdd.OrderedRDDFunctions.
OrderBy () function i s used to sort an object by its index value. Return type: Returns a new DataFrame sorted by the specified columns. Dataframe Creation: Create a new SparkSession object named spark then create a data frame with the custom data. Example 1: Sorting the data frame by a single column
Apache Spark In Spark, you can use either sort () or orderBy () function of DataFrame/Dataset to sort by ascending or descending order based on single or multiple columns, you can also do sorting using Spark SQL sorting functions, In this article, I will explain all these different ways using Scala examples. Using sort () function
The answer from Matei, who I consider authoritative on this topic, is quite clear:
The order is not guaranteed actually, only which keys end up in each partition. Reducers may fetch data from map tasks in an arbitrary order, depending on which ones are available first. If you’d like a specific order, you should sort each partition. Here you might be getting it because each partition only ends up having one element, and collect() does return the partitions in order.
In that context, a better option would be to apply the sorting to the resulting collections per key:
rdd.groupByKey().mapValues(_.sorted)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With