What is the difference between sortBy
and sortByKey
functions in Spark?
I am performing this below transformation in which i am using sortBy & sortByKey. Both are giving the same results then what is the difference in that.
val reducedSfpd = sfpd.map(x => (x(col_2),1)).reduceByKey((x,y) => x+y)
val top3Dist = reducedSfpd.sortBy(_._2,false).collect().take(3)
val top3Dist = reducedSfpd.map(x => x.swap).sortByKey(false).take(3)
Is there any performance related difference between sortBy && sortByKey.
In fact when I am using sortBy
I am saving one transformation of swapping the 'Key - Values' by applying map function. Then why sortByKey
?
In fact when I am using sortBy I am saving one transformation of swapping the 'Key - Values' by applying map function.
You don't. In practice you add an additional transformation and slightly increase network traffic on the way. sortBy
maps
input RDD to (f(x), x)
pairs, applies sortByKey
and finally takes values
. I doubt it will impact the performance but it is certainly something to remember.
While using sortBy
makes your intentions a little bit more obvious sortByKey
outputs partitioned RDD what can be useful for downstream processing.
On a side note I would use neither sortBy
nor sortByKey
to extract top elements. Instead it is better to choose top
reducedSfpd.top(3)(Ordering.by[(K, Int), Int](-_._2))
or takeOrdered
reducedSfpd.takeOrdered(3)(Ordering.by[(String, Int), Int](_._2))
with specific ordering where K
is a type of the key.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With