If I have an RDD of tuples with 5 elements, e.g., RDD(Double, String, Int, Double, Double)
How can I sort this RDD efficiently using the fifth element?
I tried to map this RDD into key-value pairs and used sortByKey, but looks like sortByKey is quite slow, it is slower than I collected this RDD and used sortWith on the collected array. Why is it like this?
Thank you very much.
Spark RDD sortByKey() Syntaxascending is used to specify the order of the sort, by default, it is true meaning ascending order, use false for descending order. numPartitions is used to specify the number of partitions it should create with the result of the sortByKey() function.
Method 1: Using sortBy() sortBy() is used to sort the data by value efficiently in pyspark. It is a method available in rdd. It uses a lambda expression to sort the data based on columns.
Action count() returns the number of elements in RDD. For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd.
sort() is more efficient compared to orderBy() because the data is sorted on each partition individually and this is why the order in the output data is not guaranteed. On the other hand, orderBy() collects all the data into a single executor and then sorts them.
You can do this with sortBy
acting directly on the RDD
:
myRdd.sortBy(_._5) // Sort by 5th field of each 5-tuple
There are extra optional parameters to define sort order ("ascending") and number of partitions.
If you want to sort by descending order & if the corresponding element is of type int, you can use "-" sign to sort the RDD in descending order.
For ex:
I've a RDD of tuple with (String, Int). To sort this RDD by its 2nd element in descending order,
rdd.sortBy(x => -x._2).collect().foreach(println);
I've a RDD of tuple with (String, String). To sort this RDD by its 2nd element in descending order,
rdd.sortBy(x => x._2, false).collect().foreach(println);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With