It is Hadoop MapReduce shuffle's default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross the parttions)
I would ask how to achieve the same thing using Spark RDD(sort within Partition,but not sort cross the partitions)
sortByKey
method is doing total orderingrepartitionAndSortWithinPartitions
is doing sort within partition but not cross partitions, but unfortunately it adds an extra step to do repartition.Is there a direct way to sort within partition but not cross partitions?
You can use Dataset
and sortWithinPartitions
method:
import spark.implicits._
sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
.toDF("text")
.sortWithinPartitions($"text")
.show
+----+
|text|
+----+
| d|
| e|
| f|
| a|
| b|
| c|
+----+
In general shuffle is an important factor in sorting partitions because it reuse shuffle structures to sort without loading all data into memory at once.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With