How does Spark decide how to partition an RDD?

Tags:

Suppose I create such an RDD (I am using Pyspark):

list_rdd = sc.parallelize(xrange(0, 20, 2), 6)

then I print the partitioned elements with the glom() method and obtain

[[0], [2, 4], [6, 8], [10], [12, 14], [16, 18]]

How has Spark decided how to partition my list? Where does that specific choice of the elements come from? It could have coupled them differently, leaving some other elements than 0 and 10 alone, to create the 6 requested partitions. At a second run, the partitions are the same.

Using a larger range, with 29 elements, I get partitions in the pattern of 2 elements followed by three elements:

list_rdd = sc.parallelize(xrange(0, 30, 2), 6)
[[0, 2], [4, 6, 8], [10, 12], [14, 16, 18], [20, 22], [24, 26, 28]]

Using a smaller range of 9 elements I get

list_rdd = sc.parallelize(xrange(0, 10, 2), 6)
[[], [0], [2], [4], [6], [8]]

So what I infer is that Spark is generating the partitions by splitting the list into a configuration where smallest possible is followed by larger collections, and repeated.

The question is if there is a reason behind this choice, which is very elegant, but does it also provide performance advantages?

467

asked Mar 04 '16 13:03

mar tin

1 Answers

Unless you specify a specific partitioner, then this is "random" in that it depends on the specific implementation of that RDD. In this case you can head to the ParallelCollectionsRDD to dig into it further.

getPartitions is defined as:

val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray

where slice is commented as (reformatted to fit better):

/**
* Slice a collection into numSlices sub-collections. 
* One extra thing we do here is to treat Range collections specially, 
* encoding the slices as other Ranges to minimize memory cost. 
* This makes it efficient to run Spark over RDDs representing large sets of numbers. 
* And if the collection is an inclusive Range, 
* we use inclusive range for the last slice.
*/

Note that there are some considerations with regards to memory. So, again, this is going to be specific to the implementation.

148

answered Sep 28 '22 02:09

Justin Pihony

Related questions
                            
                                How to run a Spark-java program from command line [closed]
                            
                                Apache Spark Throws java.lang.IllegalStateException: unread block data
                            
                                Spark Standalone Mode multiple shell sessions (applications)
                            
                                Specifying the output file name in Apache Spark
                            
                                Spark - convert string IDs to unique integer IDs
                            
                                Usage of local variables in closures when accessing Spark RDDs
                            
                                How do you read and write from/into different ElasticSearch clusters using spark and elasticsearch-hadoop?
                            
                                How to format data for the spark mlib kmeans clustering algorithm?
                            
                                How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames
                            
                                If the one partition is lost, we can use lineage to reconstruct it. Will the base RDD be loaded again?
                            
                                Use Serializable lambda in Spark JavaRDD transformation
                            
                                How does Scala compiler handle unused variable values?
                            
                                Can I run a Time Series Database (TSDB) over Apache Spark?
                            
                                Spark Mesos Cluster Mode using Dispatcher
                            
                                Get SparkUncaughtExceptionHandler when run spark-perf
                            
                                How to use Analytic/Window Functions in Spark Java?
                            
                                Zeppelin throws java.lang.OutOfMemoryError: Java heap space
                            
                                ClassNotFoundException: org.apache.spark.repl.SparkCommandLine
                            
                                Submitting spark app as a yarn job from Eclipse and Spark Context
                            
                                "java.io.IOException: Class not found" on long running Streaming application

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does Spark decide how to partition an RDD?

Tags:

apache-spark

rdd

pyspark

mar tin

People also ask

1 Answers

Justin Pihony

Recent Activity

Donate For Us