What is the role of npartitions in a Dask dataframe?

Tags:

I see the paramter npartitions in many functions, but I don't understand what it is good for / used for.

http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_csv

head(...)

Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.

repartition(...)

Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified.

Is the number of partitions probably 5 in this case:

(Image source: http://dask.pydata.org/en/latest/dataframe-overview.html )

432

asked Oct 09 '17 11:10

Martin Thoma

1 Answers

The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways.

If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.dataframe has only one partition then only one core can operate at a time.
If you have too many partitions then the scheduler may incur a lot of overhead deciding where to compute each task.

Generally you want a few times more partitions than you have cores. Every task takes up a few hundred microseconds in the scheduler.

You can determine the number of partitions either at data ingestion time using the parameters like blocksize= in read_csv(...) or afterwards by using the .repartition(...) method.

192

answered Sep 19 '22 15:09

MRocklin

Related questions
                            
                                how do you install python package without dependencies
                            
                                Numpy: Replace random elements in an array
                            
                                removing duplicates of a list of sets
                            
                                Profile Python import times
                            
                                How to prettyprint (human readably print) a Python dict in json format (double quotes)? [duplicate]
                            
                                what should be in gitignore, and how do I put env folder to gitignore and is my folder structure correct?
                            
                                Django nested if else in templates
                            
                                Putting a variable into a string (quote)
                            
                                F1-score per class for multi-class classification
                            
                                Tensorflow: Confusion regarding the adam optimizer
                            
                                Import data from excel spreadsheet to django model
                            
                                Multiple lookup_fields for django rest framework
                            
                                Converting pandas.core.series.Series to dataframe with appropriate column values python
                            
                                Global Weight Decay in Keras
                            
                                Is there a built-in KL divergence loss function in TensorFlow?
                            
                                More idiomatic way to display images in a grid with numpy
                            
                                How to make pytest wait for (manual) user action?
                            
                                How to verify JWT id_token produced by MS Azure AD?
                            
                                How much memory will a list with one million elements take up in Python?
                            
                                How tf.transpose works in tensorflow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the role of npartitions in a Dask dataframe?

Tags:

python

dataframe

dask

Martin Thoma

People also ask

1 Answers

MRocklin

Recent Activity

Donate For Us