parallelize() method in SparkContext

Tags:

apache-spark

I am trying to understand the effect of giving different numSlices to the parallelize() method in SparkContext. Given below is the Syntax of the method

def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)
(implicit arg0: ClassTag[T]): RDD[T]

I ran spark-shell in local mode

spark-shell --master local

My understanding is, numSlices decides the no of partitions of the resultant RDD(after calling sc.parallelize()). Consider few examples below

Case 1

scala> sc.parallelize(1 to 9, 1);
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> res0.partitions.size
res2: Int = 1

Case 2

scala> sc.parallelize(1 to 9, 2);
res3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22
scala> res3.partitions.size
res4: Int = 2

Case 3

scala> sc.parallelize(1 to 9, 3);
res5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
scala> res3.partitions.size
res6: Int = 2

Case 4

scala> sc.parallelize(1 to 9, 4);
res7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:22
scala> res3.partitions.size
res8: Int = 2

Question 1 : In case 3 & case 4, I was expecting the partition size to be 3 & 4 respectively, but both cases have partition size of only 2. What is the reason for this?

Question 2 : In each case there is a number associated with ParallelCollectionRDD[no]. ie In Case 1 it is ParallelCollectionRDD[0], In case 2 it is ParallelCollectionRDD[1] & so on. What exactly those numbers signify?

913

asked Nov 18 '15 19:11

Raj

1 Answers

Question 1: That's a typo on your part. You're calling res3.partitions.size, instead of res5 and res7 respectively. When I do it with the correct number, it works as expected.

Question 2: That's the id of the RDD in the Spark Context, used for keeping the graph straight. See what happens when I run the same command three times:

scala> sc.parallelize(1 to 9,1)
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22

scala> sc.parallelize(1 to 9,1)
res1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22

scala> sc.parallelize(1 to 9,1)
res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22

There are now three different RDDs with three different ids. We can run the following to check:

scala> (res0.id, res1.id, res2.id)
res3: (Int, Int, Int) = (0,1,2)

answered Oct 23 '22 02:10

Matthew Graves

Related questions
                            
                                Preserve index-string correspondence spark string indexer
                            
                                How can set the default spark logging level?
                            
                                Meaning of Apache Spark warning "Calling spill() on RowBasedKeyValueBatch"
                            
                                Why is dataset.count causing a shuffle! (spark 2.2)
                            
                                Extract information from a `org.apache.spark.sql.Row`
                            
                                What is the right way to save\load models in Spark\PySpark
                            
                                How to run independent transformations in parallel using PySpark?
                            
                                How to limit functions.collect_set in Spark SQL?
                            
                                Airflow SparkSubmitOperator - How to spark-submit in another server
                            
                                Why does Spark RDD partition has 2GB limit for HDFS?
                            
                                How to mount S3 bucket on Kubernetes container/pods?
                            
                                Why spark application fail with "executor.CoarseGrainedExecutorBackend: Driver Disassociated"?
                            
                                spark ssc.textFileStream is not streamining any files from directory
                            
                                What's the difference between spark.eventLog.dir and spark.history.fs.logDirectory?
                            
                                How to convert DataFrame to Dataset in Apache Spark in Java?
                            
                                How to subtract a column of days from a column of dates in Pyspark?
                            
                                Write DataFrame to mysql table using pySpark
                            
                                How to compute cumulative sum using Spark
                            
                                Why does spark-submit fail with "IllegalArgumentException: Missing application resource."?
                            
                                How to start and stop spark Context Manually

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With