Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark sort by key and then group by to get ordered iterable?

I have a Pair RDD (K, V) with the key containing a time and an ID. I would like to get a Pair RDD of the form (K, Iterable<V>) where the keys are groupped by id and the iterable is ordered by time.

I'm currently using sortByKey().groupByKey() and my tests seem to prove it works, however I'm reading that it may not always be the case, as discussed in this question with diverging answers ( Does groupByKey in Spark preserve the original order? ).

Is it correct or not?

Thanks!

like image 815
Ben Avatar asked Apr 22 '15 08:04

Ben


People also ask

What is the difference between sort() and orderby() in spark DataFrames?

Both sort () and orderBy () functions can be used to sort Spark DataFrames on at least one column and any desired order, namely ascending or descending. sort () is more efficient compared to orderBy () because the data is sorted on each partition individually and this is why the order in the output data is not guaranteed.

What is the use of sortbykey in spark?

Spark sortByKey () transformation is an RDD operation that is used to sort the values of the key by ascending or descending order. sortByKey () function operates on pair RDD (key/value pair) and it is available in org.apache.spark.rdd.OrderedRDDFunctions.

How to sort an object by its index value in spark?

OrderBy () function i s used to sort an object by its index value. Return type: Returns a new DataFrame sorted by the specified columns. Dataframe Creation: Create a new SparkSession object named spark then create a data frame with the custom data. Example 1: Sorting the data frame by a single column

How do I sort a Dataframe in Apache Spark?

Apache Spark In Spark, you can use either sort () or orderBy () function of DataFrame/Dataset to sort by ascending or descending order based on single or multiple columns, you can also do sorting using Spark SQL sorting functions, In this article, I will explain all these different ways using Scala examples. Using sort () function


1 Answers

The answer from Matei, who I consider authoritative on this topic, is quite clear:

The order is not guaranteed actually, only which keys end up in each partition. Reducers may fetch data from map tasks in an arbitrary order, depending on which ones are available first. If you’d like a specific order, you should sort each partition. Here you might be getting it because each partition only ends up having one element, and collect() does return the partitions in order.

In that context, a better option would be to apply the sorting to the resulting collections per key:

rdd.groupByKey().mapValues(_.sorted)
like image 186
maasg Avatar answered Sep 28 '22 11:09

maasg