Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between an RDD partition and a slice?

The Spark Programming Guide mentions slices as a feature of RDDs (both parallel collections or Hadoop datasets.) ("Spark will run one task for each slice of the cluster.") But under the section on RDD persistence, the concept of partitions is used without introduction. Also, the RDD docs only mention partitions with no mention of slices, while the SparkContext docs mentions slices for creating RDDs, but partitions for running jobs on RDDs. Are these two concepts the same? If not, how do they differ?

Tuning - Level of Parallelism indicates that "Spark automatically sets the number of “map” tasks to run on each file according to its size ... and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument...." So does this explain the difference between partitions and slices? Partitions are related to RDD storage and slices are related to degree of parallelism, and by default splices are calculated based upon either data size or number of partitions?

like image 815
Carl G Avatar asked May 02 '14 20:05

Carl G


People also ask

What is an RDD partition?

Apache Spark's Resilient Distributed Datasets (RDD) are a collection of various data that are so big in size, that they cannot fit into a single node and should be partitioned across various nodes. Apache Spark automatically partitions RDDs and distributes the partitions across different nodes.

What are the two types of RDD operations?

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

What is the difference between RDD and pair RDD?

Unpaired RDDs consists of any type of objects. However, paired RDDs (key-value) attains few special operations in it. Such as, distributed “shuffle” operations, grouping or aggregating the elements the key.

How many partitions should a Spark RDD have?

As already mentioned above, one partition is created for each block of the file in HDFS which is of size 64MB. However, when creating a RDD a second argument can be passed that defines the number of partitions to be created for an RDD. The above line of code will create an RDD named textFile with 5 partitions.


1 Answers

They are the same thing. The documentation has been fixed for Spark 1.2 thanks to Matthew Farrellee. More details in the bug: https://issues.apache.org/jira/browse/SPARK-1701

like image 194
Daniel Darabos Avatar answered Oct 07 '22 01:10

Daniel Darabos