Why is the number of partitions after groupBy 200? Why is this 200 not some other number?

Question

It's Spark 2.2.0-SNAPSHOT.

Why is the number of partitions after groupBy transformation 200 in the following example?

scala> spark.range(5).groupByKey(_ % 5).count.rdd.getNumPartitions
res0: Int = 200

What's so special about 200? Why not some other number like 1024?

I've been told about Why does groupByKey operation have always 200 tasks? that asks specifically about groupByKey, but the question is about the "mystery" behind picking 200 as the default not why there are 200 partitions by default.

Assaf Mendelson · Accepted Answer

This is set by spark.sql.shuffle.partitions

In general whenever you do a spark sql aggregation or join which shuffles data this is the number of resulting partitions.

It is constant for your entire action (i.e. it is not possible to change it for one transformation and then again for another).

See http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options for more information

Why is the number of partitions after groupBy 200? Why is this 200 not some other number?

Tags:

apache-spark

Jacek Laskowski

1 Answers

Assaf Mendelson

Recent Activity

Donate For Us

Why is the number of partitions after groupBy 200? Why is this 200 not some other number?

Tags:

apache-spark

Jacek Laskowski

1 Answers

Assaf Mendelson

Related questions

Recent Activity

Donate For Us