Spark sort by key and then group by to get ordered iterable?

Tags:

I have a Pair RDD (K, V) with the key containing a time and an ID. I would like to get a Pair RDD of the form (K, Iterable<V>) where the keys are groupped by id and the iterable is ordered by time.

I'm currently using sortByKey().groupByKey() and my tests seem to prove it works, however I'm reading that it may not always be the case, as discussed in this question with diverging answers ( Does groupByKey in Spark preserve the original order? ).

Is it correct or not?

Thanks!

815

asked Apr 22 '15 08:04

Ben

1 Answers

The answer from Matei, who I consider authoritative on this topic, is quite clear:

The order is not guaranteed actually, only which keys end up in each partition. Reducers may fetch data from map tasks in an arbitrary order, depending on which ones are available first. If you’d like a specific order, you should sort each partition. Here you might be getting it because each partition only ends up having one element, and collect() does return the partitions in order.

In that context, a better option would be to apply the sorting to the resulting collections per key:

rdd.groupByKey().mapValues(_.sorted)

186

answered Sep 28 '22 11:09

maasg

Related questions
                            
                                Comparing Enum values for sorting
                            
                                sorting when name includes letters and numeric digits
                            
                                Swing JTable sort by Date
                            
                                Java, Compare 3 integers, arrange largest, median and smallest
                            
                                What algorithm does table.sort use?
                            
                                How to sort ABAddressBook contact through First Name
                            
                                Using the bash sort command within variable-length filenames
                            
                                How to use EMBER.SORTABLEMIXIN?
                            
                                C++ Sorting objects based on two data members
                            
                                What is the most elegant way to get the value in an array, minimizing a certain class attribute?
                            
                                What sorting algorithm does visual c++ use in std::sort
                            
                                Sort array of arrays by a column with custom sorting priority (not alphabetically)
                            
                                Arrays sort method behavior
                            
                                VB.NET finding text between two words in a string [closed]
                            
                                TreeMap high-low key Integer sort
                            
                                jquery dataTable reset sorting
                            
                                Jquery incorrectly sort <li> by id
                            
                                Sort mongo results by string enum value
                            
                                Rank within columns of 2d array
                            
                                Find unsorted indices after using numpy.searchsorted

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark sort by key and then group by to get ordered iterable?

Tags:

sorting

apache-spark

Ben

People also ask

1 Answers

maasg

Recent Activity

Donate For Us