What is the difference between an RDD partition and a slice?

Tags:

The Spark Programming Guide mentions slices as a feature of RDDs (both parallel collections or Hadoop datasets.) ("Spark will run one task for each slice of the cluster.") But under the section on RDD persistence, the concept of partitions is used without introduction. Also, the RDD docs only mention partitions with no mention of slices, while the SparkContext docs mentions slices for creating RDDs, but partitions for running jobs on RDDs. Are these two concepts the same? If not, how do they differ?

Tuning - Level of Parallelism indicates that "Spark automatically sets the number of “map” tasks to run on each file according to its size ... and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument...." So does this explain the difference between partitions and slices? Partitions are related to RDD storage and slices are related to degree of parallelism, and by default splices are calculated based upon either data size or number of partitions?

815

asked May 02 '14 20:05

Carl G

1 Answers

They are the same thing. The documentation has been fixed for Spark 1.2 thanks to Matthew Farrellee. More details in the bug: https://issues.apache.org/jira/browse/SPARK-1701

194

answered Oct 07 '22 01:10

Daniel Darabos

Related questions
                            
                                Proper way of converting string to long int in PHP
                            
                                Fixing broken pipe error in uWSGI with Python
                            
                                Security of sending sensitive intent extras within my own app?
                            
                                testthat in R: sourcing in tested files
                            
                                Multiple-value in single-value context ERROR
                            
                                Run CI build on pull request merge in TeamCity
                            
                                snapshotViewAfterScreenUpdates glitch on iOS 8
                            
                                What's the difference? .on "connect" vs .on "connection"
                            
                                Most general higher-order constraint describing a sequence of integers ordered with respect to a relation
                            
                                Using Redis as a cache storage for for multiple application on the same server
                            
                                PostgreSQL fe_sendauth: no password supplied
                            
                                How to improve Dart performance of data conversion to/from binary?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between an RDD partition and a slice?

Tags:

apache-spark

hadoop

Carl G

People also ask

1 Answers

Daniel Darabos

Recent Activity

Donate For Us