Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Number reduce tasks Spark

Tags:

What is the formula that Spark uses to calculate the number of reduce tasks?

I am running a couple of spark-sql queries and the number of reduce tasks always is 200. The number of map tasks for these queries is 154. I am on Spark 1.4.1.

Is this related to spark.shuffle.sort.bypassMergeThreshold, which defaults to 200

like image 803
Uli Bethke Avatar asked Oct 23 '15 08:10

Uli Bethke


People also ask

How Spark decides number of tasks?

Number of tasks execution in parallelNumber of CPU cores available for an executor determines the number of tasks that can be executed in parallel for an application for any given time.

How do you increase Spark parallelism?

One important way to increase parallelism of spark processing is to increase the number of executors on the cluster. However, knowing how the data should be distributed, so that the cluster can process data efficiently is extremely important. The secret to achieve this is partitioning in Spark.


2 Answers

It's spark.sql.shuffle.partitions that you're after. According to the Spark SQL performance tuning guide:

| Property Name                 | Default | Meaning                                        | +-------------------------------+---------+------------------------------------------------+ | spark.sql.shuffle.partitions  | 200     | Configures the number of partitions to use     | |                               |         | when shuffling data for joins or aggregations. | 

Another option that is related is spark.default.parallelism, which determines the 'default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user', however this seems to be ignored by Spark SQL and only relevant when working on plain RDDs.

like image 160
sgvd Avatar answered Sep 18 '22 19:09

sgvd


Yes, @svgd, that is the correct parameter. Here is how you reset it in Scala:

// Set number of shuffle partitions to 3 sqlContext.setConf("spark.sql.shuffle.partitions", "3") // Verify the setting  sqlContext.getConf("spark.sql.shuffle.partitions") 
like image 23
pmhargis Avatar answered Sep 17 '22 19:09

pmhargis