How to repartition evenly in Spark?

Tags:

pyspark

To test how .repartition() works, I ran the following code:

rdd = sc.parallelize(range(100))
rdd.getNumPartitions()

rdd.getNumPartitions() resulted in 4. Then I ran:

rdd = rdd.repartition(10)
rdd.getNumPartitions()

rdd.getNumPartitions() this time resulted in 10, so there were now 10 partitions.

However, I checked the partitions by:

rdd.glom().collect()

The result gave 4 non-empty lists and 6 empty lists. Why haven't any elements been distributed to the other 6 lists?

970

asked Jun 29 '16 14:06

1 Answers

The algorithm behind repartition() uses logic to optimize the most effective way to redistribute data across partitions. In this case, your range is very small and it doesn't find it optimal to actually break the data down further. If you were to use a much bigger range like 100000, you will find that it does in fact redistribute the data.

If you want to force a certain amount of partitions, you could specify the number of partitions upon the intial load of the data. At this point, it will try to evenly distribute the data across partitions even if it's not necessarily optimal. The parallelize function takes a second argument for partitions

    rdd = sc.parallelize(range(100), 10)

The same thing would work if you were to say read from a text file

    rdd = sc.textFile('path/to/file/, numPartitions)

answered Sep 27 '22 18:09

user3124181

Related questions
                            
                                pass custom exitcode from yarn-cluster mode spark to CLI
                            
                                Is there a way to connecto Spark-Sql with sqlalchemy
                            
                                Uima Ruta Out of Memory issue in spark context
                            
                                how to calculate aggregations on a window when sensor readings are not sent if they haven't changed since last event?
                            
                                Using python lime as a udf on spark
                            
                                UDF not working in Spark SQL
                            
                                Spark Streaming with a dynamic lookup table
                            
                                Object spark is not a member of package org
                            
                                How to get a spark job's metrics?
                            
                                Is this a bug of spark stream or memory leak?
                            
                                PySpark s3 Access with Multiple AWS Credential Profiles?
                            
                                What to use to have graphical view of Spark's memory usage (with YARN)?
                            
                                Apache Spark sort partition by user ID and write each partition to CSV
                            
                                Why does sbt assembly fail with "Not a valid command: assembly"?
                            
                                Lost executor Spark
                            
                                PySpark: Numpy memory not being released in executor map-partition function (memory leak)
                            
                                Joining Spark DataFrames on a nearest key condition
                            
                                I cannot use --package option on bitnami/spark docker container
                            
                                Spark MLlib - Collaborative Filtering Implicit Feed
                            
                                Spark: What is the time complexity of the connected components algorithm used in GraphX?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to repartition evenly in Spark?

Tags:

apache-spark

pyspark

cshin9

People also ask

1 Answers

user3124181

Recent Activity

Donate For Us