Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the number of partitions after groupBy 200? Why is this 200 not some other number?

Tags:

apache-spark

It's Spark 2.2.0-SNAPSHOT.

Why is the number of partitions after groupBy transformation 200 in the following example?

scala> spark.range(5).groupByKey(_ % 5).count.rdd.getNumPartitions
res0: Int = 200

What's so special about 200? Why not some other number like 1024?

I've been told about Why does groupByKey operation have always 200 tasks? that asks specifically about groupByKey, but the question is about the "mystery" behind picking 200 as the default not why there are 200 partitions by default.

like image 279
Jacek Laskowski Avatar asked Dec 28 '16 09:12

Jacek Laskowski


1 Answers

This is set by spark.sql.shuffle.partitions

In general whenever you do a spark sql aggregation or join which shuffles data this is the number of resulting partitions.

It is constant for your entire action (i.e. it is not possible to change it for one transformation and then again for another).

See http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options for more information

like image 93
Assaf Mendelson Avatar answered Oct 05 '22 00:10

Assaf Mendelson