Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does partitioning work in Spark?

I'm trying to understand how partitioning is done in Apache Spark. Can you guys help please?

Here is the scenario:

  • a master and two nodes with 1 core each
  • a file count.txt of 10 MB in size

How many partitions does the following create?

rdd = sc.textFile(count.txt)

Does the size of the file have any impact on the number of partitions?

like image 584
abhishek kurasala Avatar asked Oct 14 '14 19:10

abhishek kurasala


People also ask

How does hash partitioning work in Spark?

Spark Default PartitionerThe Hash Partitioner works on the concept of using the hashcode() function. The concept of hashcode() is that equal objects have the same hashcode. On the basis of this concept, the Hash Partitioner will divide the keys that have the same hashcode and distribute them across the partitions.

How are Spark partitions created?

When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file. If you have a 30GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (128MB) and default spark.

How does Spark determine number of partitions?

(a) If parent RDD has a partitioner on aggregation key(s), then the number of partitions in the aggregated RDD is equal to the number of partitions in the parent RDD. (b) If parent RDD does not have a partitioner, then the number of partitions in the aggregated RDD is equal to the value of 'spark. default.

What does it mean to partition a table in Spark?

October 07, 2022. A partition is composed of a subset of rows in a table that share the same value for a predefined subset of columns called the partitioning columns. Using partitions can speed up queries against the table as well as data manipulation.


1 Answers

By default a partition is created for each HDFS partition, which by default is 64MB (from the Spark Programming Guide).

It's possible to pass another parameter defaultMinPartitions which overrides the minimum number of partitions that spark will create. If you don't override this value then spark will create at least as many partitions as spark.default.parallelism.

Since spark.default.parallelism is supposed to be the number of cores across all of the machines in your cluster I believe that there would be at least 3 partitions created in your case.

You can also repartition or coalesce an RDD to change the number of partitions that in turn influences the total amount of available parallelism.

like image 82
mrmcgreg Avatar answered Oct 14 '22 11:10

mrmcgreg