Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - Difference between sortBy and sortByKey

Tags:

apache-spark

What is the difference between sortBy and sortByKey functions in Spark?

I am performing this below transformation in which i am using sortBy & sortByKey. Both are giving the same results then what is the difference in that.

val reducedSfpd = sfpd.map(x => (x(col_2),1)).reduceByKey((x,y) => x+y)

val top3Dist = reducedSfpd.sortBy(_._2,false).collect().take(3)    
val top3Dist = reducedSfpd.map(x => x.swap).sortByKey(false).take(3)

Is there any performance related difference between sortBy && sortByKey.

In fact when I am using sortBy I am saving one transformation of swapping the 'Key - Values' by applying map function. Then why sortByKey?

like image 225
AJm Avatar asked Feb 01 '16 16:02

AJm


1 Answers

In fact when I am using sortBy I am saving one transformation of swapping the 'Key - Values' by applying map function.

You don't. In practice you add an additional transformation and slightly increase network traffic on the way. sortBy maps input RDD to (f(x), x) pairs, applies sortByKey and finally takes values. I doubt it will impact the performance but it is certainly something to remember.

While using sortBy makes your intentions a little bit more obvious sortByKey outputs partitioned RDD what can be useful for downstream processing.

On a side note I would use neither sortBy nor sortByKey to extract top elements. Instead it is better to choose top

reducedSfpd.top(3)(Ordering.by[(K, Int), Int](-_._2))

or takeOrdered

reducedSfpd.takeOrdered(3)(Ordering.by[(String, Int), Int](_._2))

with specific ordering where K is a type of the key.

like image 187
zero323 Avatar answered Sep 19 '22 20:09

zero323