I'm trying to understand how partitioning is done in Apache Spark. Can you guys help please? Here is the scenario: <ul> <li>a master and two nodes with 1 core each</li> <li>a file <code>count.txt</code> of 10 MB in size</li> </ul> How many partitions does the following create? <pre class="prettyprint"><code>rdd = sc.textFile(count.txt) </code></pre> Does the size of the file have any impact on the number of partitions?

By default a partition is created for each HDFS partition, which by default is 64MB (from the Spark Programming Guide). It's possible to pass another parameter <code>defaultMinPartitions</code> which overrides the minimum number of partitions that spark will create. If you don't override this value then spark will create at least as many partitions as <code>spark.default.parallelism</code>. Since <code>spark.default.parallelism</code> is supposed to be the number of cores across all of the machines in your cluster I believe that there would be at least 3 partitions created in your case. You can also <code>repartition</code> or <code>coalesce</code> an RDD to change the number of partitions that in turn influences the total amount of available parallelism.

How does partitioning work in Spark?

Tags:

apache-spark

partitioning

I'm trying to understand how partitioning is done in Apache Spark. Can you guys help please?

Here is the scenario:

a master and two nodes with 1 core each
a file count.txt of 10 MB in size

How many partitions does the following create?

rdd = sc.textFile(count.txt)

Does the size of the file have any impact on the number of partitions?

584

asked Oct 14 '14 19:10

abhishek kurasala

1 Answers

By default a partition is created for each HDFS partition, which by default is 64MB (from the Spark Programming Guide).

It's possible to pass another parameter defaultMinPartitions which overrides the minimum number of partitions that spark will create. If you don't override this value then spark will create at least as many partitions as spark.default.parallelism.

Since spark.default.parallelism is supposed to be the number of cores across all of the machines in your cluster I believe that there would be at least 3 partitions created in your case.

You can also repartition or coalesce an RDD to change the number of partitions that in turn influences the total amount of available parallelism.

answered Oct 14 '22 11:10

mrmcgreg

Related questions
                            
                                Filter rows in Spark dataframe from the words in RDD
                            
                                Saving ordered dataframe in Spark
                            
                                How to debug the function passed to mapPartitions
                            
                                Remove new line from CSV file
                            
                                Connect to spark cluster from local jupyter notebook
                            
                                Pyspark > Dataframe with multiple array columns into multiple rows with one value each
                            
                                How to keep the Spark web UI alive?
                            
                                Using partitionBy on a DataFrameWriter writes directory layout with column names not just values
                            
                                What is the difference between an RDD partition and a slice?
                            
                                How do I call a UDF on a Spark DataFrame using JAVA?
                            
                                Group spark dataframe by date
                            
                                Pyspark dataframe convert multiple columns to float
                            
                                Are failed tasks resubmitted in Apache Spark?
                            
                                Comparing columns in Pyspark
                            
                                Failed to bind to: spark-master, using a remote cluster with two workers
                            
                                What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?
                            
                                Apache Spark: network errors between executors
                            
                                Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?
                            
                                How to extract an element from a array in pyspark
                            
                                Spark cache vs broadcast

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With