Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sort an RDD of tuples with 5 elements in Spark Scala?

If I have an RDD of tuples with 5 elements, e.g., RDD(Double, String, Int, Double, Double)

How can I sort this RDD efficiently using the fifth element?

I tried to map this RDD into key-value pairs and used sortByKey, but looks like sortByKey is quite slow, it is slower than I collected this RDD and used sortWith on the collected array. Why is it like this?

Thank you very much.

like image 665
Carter Avatar asked Oct 13 '15 07:10

Carter


People also ask

How do I sort in RDD?

Spark RDD sortByKey() Syntaxascending is used to specify the order of the sort, by default, it is true meaning ascending order, use false for descending order. numPartitions is used to specify the number of partitions it should create with the result of the sortByKey() function.

How do you sort RDD by value?

Method 1: Using sortBy() sortBy() is used to sort the data by value efficiently in pyspark. It is a method available in rdd. It uses a lambda expression to sort the data based on columns.

Which function returns a list that contains all the elements in RDD?

Action count() returns the number of elements in RDD. For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd.

What is the difference between sort and orderBy in Spark?

sort() is more efficient compared to orderBy() because the data is sorted on each partition individually and this is why the order in the output data is not guaranteed. On the other hand, orderBy() collects all the data into a single executor and then sorts them.


2 Answers

You can do this with sortBy acting directly on the RDD:

myRdd.sortBy(_._5) // Sort by 5th field of each 5-tuple

There are extra optional parameters to define sort order ("ascending") and number of partitions.

like image 198
Shadowlands Avatar answered Oct 21 '22 04:10

Shadowlands


If you want to sort by descending order & if the corresponding element is of type int, you can use "-" sign to sort the RDD in descending order.

For ex:

I've a RDD of tuple with (String, Int). To sort this RDD by its 2nd element in descending order,

rdd.sortBy(x => -x._2).collect().foreach(println);

I've a RDD of tuple with (String, String). To sort this RDD by its 2nd element in descending order,

rdd.sortBy(x => x._2, false).collect().foreach(println);
like image 33
Sivakumar Avatar answered Oct 21 '22 03:10

Sivakumar