How does spark determine the number of partitions after using an <code>orderBy</code>? I always thought that the resulting dataframe has <code>spark.sql.shuffle.partitions</code>, but this does not seem to be true : <pre class="prettyprint lang-scala prettyprint-override"><code>val df = (1 to 10000).map(i => ("a",i)).toDF("n","i").repartition(10).cache df.orderBy($"i").rdd.getNumPartitions // = 200 (=spark.sql.shuffle.partitions) df.orderBy($"n").rdd.getNumPartitions // = 2 </code></pre> In both cases, spark does <code>+- Exchange rangepartitioning(i/n ASC NULLS FIRST, 200)</code>, so how can the resulting number of partitions in the second case be 2?

<code>spark.sql.shuffle.partitions</code> is used as an upper bound. The final number of partitions is <code>1 <= partitions <= spark.sql.shuffle.partition</code>. <hr> As you've mentioned, the sorting in Spark goes through <code>RangePartitioner</code>. What it tries to achieve is to partition your dataset into a specified number (<code>spark.sql.shuffle.partition</code>) of roughly equal ranges. There's a guarantee that equal values will be in the same partition after the partitioning. It's worth checking <code>RangePartitioning</code> (not part of the public API) class documentation: <blockquote> ... All row where the expressions in <code>ordering</code> evaluate to the same values will be in the same partition </blockquote> And if the number of distinct ordering values is less than the desired number of partitions, i.e. the number of possible ranges is less than <code>spark.sql.shuffle.partition</code>, you'll end up with a smaller number of partitions. Also, here's a quote from <code>RangePartitioner</code> Scaladoc: <blockquote> The actual number of partitions created by the RangePartitioner might not be the same as the partitions parameter, in the case where the number of sampled records is less than the value of partitions. </blockquote> Going back to your example, <code>n</code> is a constant (<code>"a"</code>) and could not be partitioned. On the other hand, <code>i</code> can have 10,000 possible values and is partitioned into 200 (<code>=spark.sql.shuffle.partition</code>) ranges or partitions. Note that this is only true for DataFrame/Dataset API. When using RDD's <code>sortByKey</code> one can either specify the number of partitions explicitly or Spark will use the current number of partitions. See also: <ul> <li>How does Spark achieve sort order?</li> </ul>

Number of dataframe partitions after sorting?

Tags:

apache-spark

apache-spark-sql

How does spark determine the number of partitions after using an orderBy? I always thought that the resulting dataframe has spark.sql.shuffle.partitions, but this does not seem to be true :

val df = (1 to 10000).map(i => ("a",i)).toDF("n","i").repartition(10).cache

df.orderBy($"i").rdd.getNumPartitions // = 200 (=spark.sql.shuffle.partitions)
df.orderBy($"n").rdd.getNumPartitions // = 2

In both cases, spark does +- Exchange rangepartitioning(i/n ASC NULLS FIRST, 200), so how can the resulting number of partitions in the second case be 2?

923

asked Dec 14 '18 19:12

Raphael Roth

2 Answers

spark.sql.shuffle.partitions is used as an upper bound. The final number of partitions is 1 <= partitions <= spark.sql.shuffle.partition.

As you've mentioned, the sorting in Spark goes through RangePartitioner. What it tries to achieve is to partition your dataset into a specified number (spark.sql.shuffle.partition) of roughly equal ranges.

There's a guarantee that equal values will be in the same partition after the partitioning. It's worth checking RangePartitioning (not part of the public API) class documentation:

...

All row where the expressions in ordering evaluate to the same values will be in the same partition

And if the number of distinct ordering values is less than the desired number of partitions, i.e. the number of possible ranges is less than spark.sql.shuffle.partition, you'll end up with a smaller number of partitions. Also, here's a quote from RangePartitioner Scaladoc:

The actual number of partitions created by the RangePartitioner might not be the same as the partitions parameter, in the case where the number of sampled records is less than the value of partitions.

Going back to your example, n is a constant ("a") and could not be partitioned. On the other hand, i can have 10,000 possible values and is partitioned into 200 (=spark.sql.shuffle.partition) ranges or partitions.

Note that this is only true for DataFrame/Dataset API. When using RDD's sortByKey one can either specify the number of partitions explicitly or Spark will use the current number of partitions.

Sergey Khudyakov

I ran various tests so as to look at this more empirically, in addition to looking at Range Partitioning for Sorting - which is the crux of the matter here. See How does range partitioner work in Spark?.

Having experimented with both 1 distinct value for "n" as in the example in the question, and more than 1 such distinct value for the "n", then using various dataframe sizes with df.orderBy($"n"):

it is clear that the calculation for determining the number of partitions that will contain ranges of data for sorting subsequently via mapPartitions,
which is based on sampling from the existing partitions prior to computing some heuristically optimal number of partitions for these computed ranges,
will in most cases compute and thus generate N+1 partitions, whereby partition N+1 is empty.

The fact that the extra partition allocated is nearly always empty makes me think there is a calculation error in the coding in some way, in other words a small bug imho.

I base this on the following simple test, which does return what RR I suspect would consider to be the proper number of partitions:

val df_a1 = (1 to 1).map(i => ("a",i)).toDF("n","i").cache
val df_a2 = (1 to 1).map(i => ("b",i)).toDF("n","i").cache
val df_a3 = (1 to 1).map(i => ("c",i)).toDF("n","i").cache
val df_b = df_a1.union(df_a2)
val df_c = df_b.union(df_a3)

df_c.orderBy($"n")
 .rdd
 .mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}
 .toDF("partition_number","number_of_records")
 .show(100,false)

returns:

+----------------+-----------------+
|partition_number|number_of_records|
+----------------+-----------------+
|0               |1                |
|1               |1                |
|2               |1                |
+----------------+-----------------+

This boundary example calculation is rather simple. As soon as I use 1 to 2 or 1 .. N for any of the "n", the extra empty partitions results:

+----------------+-----------------+
|partition_number|number_of_records|
+----------------+-----------------+
|0               |2                |
|1               |1                |
|2               |1                |
|3               |0                |
+----------------+-----------------+

The sorting requires all data for a given "n" or set of "n" to be in the same partition.

answered Oct 02 '22 15:10

thebluephantom

Related questions
                            
                                Spark: How RDD.map/mapToPair work with Java
                            
                                spark on yarn run double times when error [duplicate]
                            
                                Spark Dataset equivalent for scala's "collect" taking a partial function
                            
                                How to add new columns to DataFrame given their names when they are missing?
                            
                                How to convert Dataset into JavaPairRDD?
                            
                                Why would Spark executors be removed (with "ExecutorAllocationManager: Request to remove executorIds" in the logs)?
                            
                                How to change column metadata in pyspark?
                            
                                How to write rows asynchronously in Spark Streaming application to speed up batch execution?
                            
                                spark-sql Table or view not found error
                            
                                How to join/merge a list of dataframes with common keys in PySpark?
                            
                                How to display a streaming DataFrame (as show fails with AnalysisException)?
                            
                                How to force repartitioning in a spark dataframe?
                            
                                Eclipse remote debug spark-submit
                            
                                How to create schema (StructType) with one or more StructTypes?
                            
                                How to convert nested avro GenericRecord to Row
                            
                                PySpark aggregation function for "any value"
                            
                                Saving empty DataFrame with known schema (Spark 2.2.1)
                            
                                Why does array_contains accept columns for both arguments in SQL but not in Dataset API?
                            
                                Spark Structured Streaming - Limitations? (Source Performance, Unsupported Operations, Spark UI)
                            
                                Incompatible Jackson version: Spark Structured Streaming

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With