Spark standalone cluster with a master and 2 worker nodes 4 cpu core on each worker. Total 8 cores for all workers.
When running the following via spark-submit (spark.default.parallelism is not set)
val myRDD = sc.parallelize(1 to 100000)
println("Partititon size - " + myRDD.partitions.size)
val totl = myRDD.reduce((x, y) => x + y)
println("Sum - " + totl)
It returns value 2 for partition size.
When using spark-shell by connecting to spark standalone cluster the same code returns correct partition size 8.
What can be the reason ?
Thanks.
For example, the default for spark. default. parallelism is only 2 x the number of virtual cores available, though parallelism can be higher for a large cluster. Spark on YARN can dynamically scale the number of executors used for a Spark application based on the workloads.
partitions configures the number of partitions that are used when shuffling data for joins or aggregations. spark. default. parallelism is the default number of partitions in RDD s returned by transformations like join , reduceByKey , and parallelize when not set explicitly by the user.
Spark Default Shuffle Partition sql. shuffle. partitions which is by default set to 200 . You can change this default shuffle partition value using conf method of the SparkSession object or using Spark Submit Command Configurations.
spark.default.parallelism
defaults to the number of all cores on all machines. The parallelize api has no parent RDD to determine the number of partitions, so it uses the spark.default.parallelism
.
When running spark-submit
, you're probably running it locally. Try submitting your spark-submit
with the same start up configs as you do the spark-shell.
Pulled this from the documentation:
spark.default.parallelism
For distributed shuffle operations like reduceByKey
and join
, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger
Default number of partitions in RDDs returned by transformations like join
, reduceByKey
, and parallelize when not set by user.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With