How to re-partition pyspark dataframe?

Tags:

data.rdd.getNumPartitions() # output 2456

Then I do
data.rdd.repartition(3000) But
data.rdd.getNumPartitions() # output is still 2456

How to change number of partitions. One approach can be first convert DF into rdd,repartition it and then convert rdd back to DF. But this takes a lot of time. Also does increasing number of partitions make operations more distributed and so more fast? Thanks

931

asked Aug 23 '17 16:08

Neo

2 Answers

You can check the number of partitions:

data.rdd.partitions.size

To change the number of partitions:

newDF = data.repartition(3000)

You can check the number of partitions:

newDF.rdd.partitions.size

Beware of data shuffle when repartitionning and this is expensive. Take a look at coalesce if needed.

answered Nov 05 '22 15:11

Michel Lemay

print df.rdd.getNumPartitions()
# 1


df.repartition(5)
print df.rdd.getNumPartitions()
# 1


df = df.repartition(5)
print df.rdd.getNumPartitions()
# 5

see Spark: The definitive Guide chapter 5- Basic Structure Operations
ISBN-13: 978-1491912218
ISBN-10: 1491912219

answered Nov 05 '22 14:11

Ali Payne

Related questions
                            
                                How to structure data to easily build HTML tables in Flask
                            
                                How to count no of rows in table from web application using selenium python webdriver
                            
                                ndb to_dict method does not include object's key
                            
                                I'm not able to import Flask-WTF TextField and BooleanField
                            
                                Check if a function uses @classmethod
                            
                                In Tkinter, How I disable Entry?
                            
                                Can I POST data with python requests lib with http-gzip or deflate compression?
                            
                                Python - OSError: [WinError 17] The system cannot move the file to a different disk drive:
                            
                                Drawing multiple edges between two nodes with networkx
                            
                                Write dictionary values in an excel file
                            
                                How to calculate percentage with Pandas' DataFrame
                            
                                Pyplot: using percentage on x axis
                            
                                Nginx Django and Gunicorn. Gunicorn sock file is missing?
                            
                                How do I use within / in operator in a Pandas DataFrame? [duplicate]
                            
                                Install gdal using conda?
                            
                                Calculating cumulative returns with pandas dataframe
                            
                                Pandas Counting Unique Rows
                            
                                Splitting a list into uneven groups?
                            
                                How to measure the speed of a python function
                            
                                Creating a Gin Index with Trigram (gin_trgm_ops) in Django model

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to re-partition pyspark dataframe?

Tags:

python

machine-learning

apache-spark

pyspark

Neo

People also ask

2 Answers

Michel Lemay

Ali Payne

Recent Activity

Donate For Us