Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to re-partition pyspark dataframe?

data.rdd.getNumPartitions() # output 2456

Then I do
data.rdd.repartition(3000) But
data.rdd.getNumPartitions() # output is still 2456

How to change number of partitions. One approach can be first convert DF into rdd,repartition it and then convert rdd back to DF. But this takes a lot of time. Also does increasing number of partitions make operations more distributed and so more fast? Thanks

like image 931
Neo Avatar asked Aug 23 '17 16:08

Neo


People also ask

How does repartition work in PySpark?

The repartition redistributes the data by allowing full shuffling of data. We can increase or decrease the number of partitions using the concept of Repartition. There is a by default shuffle partition that allows the shuffling of data, this property is used for the repartition of data.

How do you repartition in Spark?

The repartition() method is used to increase or decrease the number of partitions of an RDD or dataframe in spark. This method performs a full shuffle of data across all the nodes. It creates partitions of more or less equal in size. This is a costly operation given that it involves data movement all over the network.

How do I increase partition in PySpark?

If you want to increase the partitions of your DataFrame, all you need to run is the repartition() function. Returns a new DataFrame partitioned by the given partitioning expressions.


2 Answers

You can check the number of partitions:

data.rdd.partitions.size

To change the number of partitions:

newDF = data.repartition(3000)

You can check the number of partitions:

newDF.rdd.partitions.size

Beware of data shuffle when repartitionning and this is expensive. Take a look at coalesce if needed.

like image 94
Michel Lemay Avatar answered Nov 05 '22 15:11

Michel Lemay


print df.rdd.getNumPartitions()
# 1


df.repartition(5)
print df.rdd.getNumPartitions()
# 1


df = df.repartition(5)
print df.rdd.getNumPartitions()
# 5

see Spark: The definitive Guide chapter 5- Basic Structure Operations
ISBN-13: 978-1491912218
ISBN-10: 1491912219

like image 27
Ali Payne Avatar answered Nov 05 '22 14:11

Ali Payne