data.rdd.getNumPartitions() # output 2456
Then I dodata.rdd.repartition(3000)
Butdata.rdd.getNumPartitions()
# output is still 2456
How to change number of partitions. One approach can be first convert DF into rdd,repartition it and then convert rdd back to DF. But this takes a lot of time. Also does increasing number of partitions make operations more distributed and so more fast? Thanks
The repartition redistributes the data by allowing full shuffling of data. We can increase or decrease the number of partitions using the concept of Repartition. There is a by default shuffle partition that allows the shuffling of data, this property is used for the repartition of data.
The repartition() method is used to increase or decrease the number of partitions of an RDD or dataframe in spark. This method performs a full shuffle of data across all the nodes. It creates partitions of more or less equal in size. This is a costly operation given that it involves data movement all over the network.
If you want to increase the partitions of your DataFrame, all you need to run is the repartition() function. Returns a new DataFrame partitioned by the given partitioning expressions.
You can check the number of partitions:
data.rdd.partitions.size
To change the number of partitions:
newDF = data.repartition(3000)
You can check the number of partitions:
newDF.rdd.partitions.size
Beware of data shuffle when repartitionning and this is expensive. Take a look at coalesce
if needed.
print df.rdd.getNumPartitions()
# 1
df.repartition(5)
print df.rdd.getNumPartitions()
# 1
df = df.repartition(5)
print df.rdd.getNumPartitions()
# 5
see Spark: The definitive Guide chapter 5- Basic Structure Operations
ISBN-13: 978-1491912218
ISBN-10: 1491912219
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With