When should I repartition an RDD?

Tags:

I know that I can repartition an RDD to increase its partitions and use coalesce to decrease its partitions. I have two questions regarding this that I cannot completely understand after reading different resources.

Spark will use a sensible default (1 partition per block which is 64MB in first versions and now 128MB) when generating an RDD. But I also read that it is recommended to use 2 or 3 times the number of cores running the jobs. So here comes the question:

How many partitions should I use for a given file? For example, suppose I have a 10GB .parquet file, 3 executors with 2 cores and 3gb memory each. Should I repartition? How many partitions should I use? What is the better way to make that choice?
Are all data types (ie .txt, .parquet, etc..) repartitioned by default if no partitioning is provided?

547

asked Aug 18 '17 03:08

Marcos

1 Answers

Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the cluster.

For example :

val rdd= sc.textFile ("file.txt", 5)

The above line of code will create an RDD named textFile with 5 partitions.

Suppose that you have a cluster with 4 cores and assume that each partition needs to process for 5 minutes. In case of the above RDD with 5 partitions, 4 partition processes will run in parallel as there are 4 cores and the 5th partition process will process after 5 minutes when one of the 4 cores, is free.

The entire processing will be completed in 10 minutes and during the 5th partition process, the resources (remaining 3 cores) will remain idle.

The best way to decide on the number of partitions in a RDD is to make the number of partitions equal to the number of cores in the cluster so that all the partitions will process in parallel and the resources will be utilized in an optimal way.

Question : Are all data types (ie .txt, .parquet, etc..) repartitioned by default if no partitioning is provided?

There will be default no of partitions for every rdd. to check you can use rdd.partitions.length right after rdd created.

to use existing cluster resources in optimal way and to speed up, we have to consider re-partitioning to ensure that all cores are utilized and all partitions have enough number of records which are uniformly distributed.

For better understanding, also have a look at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html

Note : There is no fixed formula for this. general convention most of them follow is

(numOf executors * no of cores) * replicationfactor (which may be 2 or 3 times more )

answered Sep 17 '22 17:09

Ram Ghadiyaram

Related questions
                            
                                Run 3000+ Random Forest Models By Group Using Spark MLlib Scala API
                            
                                Understanding treeReduce() in Spark
                            
                                Find name of currently running SparkContext
                            
                                What does the Spark UI light blue part of Tasks progress bar indicate?
                            
                                collect RDD with buffer in pyspark
                            
                                Spark, DataFrame: apply transformer/estimator on groups
                            
                                Spark SQL package not found
                            
                                Re-using A Schema from JSON within a Spark DataFrame using Scala
                            
                                Reading large file in Spark issue - python
                            
                                spark executor out of memory in join and reduceByKey
                            
                                Cannot load main class from JAR file
                            
                                How to do non-random Dataset splitting on Apache Spark?
                            
                                How save list to file in spark?
                            
                                PySpark - Add a new nested column or change the value of existing nested columns
                            
                                SparkContext setLocalProperties
                            
                                How to find first non-null values in groups? (secondary sorting using dataset api)
                            
                                Difference between combinebykey and aggregatebykey
                            
                                Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
                            
                                Can we able to use mulitple sparksessions to access two different Hive servers
                            
                                Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When should I repartition an RDD?

Tags:

apache-spark

rdd

partitioning

Marcos

People also ask

1 Answers

Note : There is no fixed formula for this. general convention most of them follow is

Ram Ghadiyaram

Recent Activity

Donate For Us