Spark - Difference between sortBy and sortByKey

Question

What is the difference between sortBy and sortByKey functions in Spark?

I am performing this below transformation in which i am using sortBy & sortByKey. Both are giving the same results then what is the difference in that.

val reducedSfpd = sfpd.map(x => (x(col_2),1)).reduceByKey((x,y) => x+y)

val top3Dist = reducedSfpd.sortBy(_._2,false).collect().take(3)    
val top3Dist = reducedSfpd.map(x => x.swap).sortByKey(false).take(3)

Is there any performance related difference between sortBy && sortByKey.

In fact when I am using sortBy I am saving one transformation of swapping the 'Key - Values' by applying map function. Then why sortByKey?

zero323 · Accepted Answer

In fact when I am using sortBy I am saving one transformation of swapping the 'Key - Values' by applying map function.

You don't. In practice you add an additional transformation and slightly increase network traffic on the way. sortBy maps input RDD to (f(x), x) pairs, applies sortByKey and finally takes values. I doubt it will impact the performance but it is certainly something to remember.

While using sortBy makes your intentions a little bit more obvious sortByKey outputs partitioned RDD what can be useful for downstream processing.

On a side note I would use neither sortBy nor sortByKey to extract top elements. Instead it is better to choose top

reducedSfpd.top(3)(Ordering.by[(K, Int), Int](-_._2))

or takeOrdered

reducedSfpd.takeOrdered(3)(Ordering.by[(String, Int), Int](_._2))

with specific ordering where K is a type of the key.

Spark - Difference between sortBy and sortByKey

Tags:

apache-spark

AJm

1 Answers

zero323

Recent Activity

Donate For Us

Spark - Difference between sortBy and sortByKey

Tags:

apache-spark

AJm

1 Answers

zero323

Related questions

Recent Activity

Donate For Us