spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit

Tags:

apache-spark

Spark standalone cluster with a master and 2 worker nodes 4 cpu core on each worker. Total 8 cores for all workers.

When running the following via spark-submit (spark.default.parallelism is not set)

val myRDD = sc.parallelize(1 to 100000)
println("Partititon size - " + myRDD.partitions.size)
val totl = myRDD.reduce((x, y) => x + y)
println("Sum - " + totl)

It returns value 2 for partition size.

When using spark-shell by connecting to spark standalone cluster the same code returns correct partition size 8.

What can be the reason ?

Thanks.

533

asked Feb 13 '16 19:02

1 Answers

spark.default.parallelism defaults to the number of all cores on all machines. The parallelize api has no parent RDD to determine the number of partitions, so it uses the spark.default.parallelism.

When running spark-submit, you're probably running it locally. Try submitting your spark-submit with the same start up configs as you do the spark-shell.

Pulled this from the documentation:

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine

Mesos fine grained mode: 8

Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

195

answered Sep 30 '22 23:09

Joe Widen

Related questions
                            
                                Coursera - Functional Programming Principles in Scala - can't work with example project because of errors
                            
                                Spray won't convert my case class to json and expect a spray.httpx.marshalling.ToResponseMarshallable
                            
                                Intellij not able to run Scala Code
                            
                                Configuring logback to write to different logs files
                            
                                Play Json Reads and String
                            
                                Abstraction over a sequence of types
                            
                                Merging scalaz-stream input processes seems to "wait" on stdin
                            
                                Scala, Composing Function with two values
                            
                                Log all messages in Akka without modifying all receive methods
                            
                                How to prove size of a list in Leon?
                            
                                Why Scala escapes spaces in method names?
                            
                                spray json implicit UUID conversion
                            
                                Scala Shapeless Code for Project Euler #2
                            
                                Generate resources for compile and test
                            
                                (Play 2.4) Dependency injection in a trait?
                            
                                When writing SBT tasks in build.sbt, how do I use my library dependencies?
                            
                                Scala error: type arguments do not conform to class type parameter bounds
                            
                                Shapeless: Trying to restrict HList elements by their type
                            
                                Scala pattern match is not exhaustive on nested case classes
                            
                                ScalaTest assert and matchers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit

Tags:

scala

apache-spark

Sami

People also ask

1 Answers

Joe Widen

Recent Activity

Donate For Us