How to sort within partitions (and avoid sort across the partitions) using RDD API?

Question

It is Hadoop MapReduce shuffle's default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross the parttions)

I would ask how to achieve the same thing using Spark RDD(sort within Partition,but not sort cross the partitions)

RDD's sortByKey method is doing total ordering
RDD's repartitionAndSortWithinPartitions is doing sort within partition but not cross partitions, but unfortunately it adds an extra step to do repartition.

Is there a direct way to sort within partition but not cross partitions?

user7849215 · Accepted Answer

You can use Dataset and sortWithinPartitions method:

import spark.implicits._

sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
  .toDF("text")
  .sortWithinPartitions($"text")
  .show

+----+
|text|
+----+
|   d|
|   e|
|   f|
|   a|
|   b|
|   c|
+----+

In general shuffle is an important factor in sorting partitions because it reuse shuffle structures to sort without loading all data into memory at once.

How to sort within partitions (and avoid sort across the partitions) using RDD API?

Tags:

apache-spark

Tom

1 Answers

user7849215

Recent Activity

Donate For Us

How to sort within partitions (and avoid sort across the partitions) using RDD API?

Tags:

apache-spark

Tom

1 Answers

user7849215

Related questions

Recent Activity

Donate For Us