Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sort within partitions (and avoid sort across the partitions) using RDD API?

Tags:

apache-spark

It is Hadoop MapReduce shuffle's default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross the parttions)

I would ask how to achieve the same thing using Spark RDD(sort within Partition,but not sort cross the partitions)

  1. RDD's sortByKey method is doing total ordering
  2. RDD's repartitionAndSortWithinPartitions is doing sort within partition but not cross partitions, but unfortunately it adds an extra step to do repartition.

Is there a direct way to sort within partition but not cross partitions?

like image 672
Tom Avatar asked Apr 11 '17 07:04

Tom


1 Answers

You can use Dataset and sortWithinPartitions method:

import spark.implicits._

sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
  .toDF("text")
  .sortWithinPartitions($"text")
  .show

+----+
|text|
+----+
|   d|
|   e|
|   f|
|   a|
|   b|
|   c|
+----+

In general shuffle is an important factor in sorting partitions because it reuse shuffle structures to sort without loading all data into memory at once.

like image 197
user7849215 Avatar answered Oct 18 '22 09:10

user7849215