Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit

Spark standalone cluster with a master and 2 worker nodes 4 cpu core on each worker. Total 8 cores for all workers.

When running the following via spark-submit (spark.default.parallelism is not set)

val myRDD = sc.parallelize(1 to 100000)
println("Partititon size - " + myRDD.partitions.size)
val totl = myRDD.reduce((x, y) => x + y)
println("Sum - " + totl)

It returns value 2 for partition size.

When using spark-shell by connecting to spark standalone cluster the same code returns correct partition size 8.

What can be the reason ?

Thanks.

like image 533
Sami Avatar asked Feb 13 '16 19:02

Sami


People also ask

What is the default value of Spark default parallelism?

For example, the default for spark. default. parallelism is only 2 x the number of virtual cores available, though parallelism can be higher for a large cluster. Spark on YARN can dynamically scale the number of executors used for a Spark application based on the workloads.

What is the difference between Spark SQL shuffle partitions and Spark default parallelism?

partitions configures the number of partitions that are used when shuffling data for joins or aggregations. spark. default. parallelism is the default number of partitions in RDD s returned by transformations like join , reduceByKey , and parallelize when not set explicitly by the user.

How do I change the default partition in Spark?

Spark Default Shuffle Partition sql. shuffle. partitions which is by default set to 200 . You can change this default shuffle partition value using conf method of the SparkSession object or using Spark Submit Command Configurations.


1 Answers

spark.default.parallelism defaults to the number of all cores on all machines. The parallelize api has no parent RDD to determine the number of partitions, so it uses the spark.default.parallelism.

When running spark-submit, you're probably running it locally. Try submitting your spark-submit with the same start up configs as you do the spark-shell.

Pulled this from the documentation:

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine

Mesos fine grained mode: 8

Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

like image 195
Joe Widen Avatar answered Sep 30 '22 23:09

Joe Widen