When running the following code, the result of dask.dataframe.head() depends on npartitions: <pre class="prettyprint"><code>import dask.dataframe as dd import pandas as pd df = pd.DataFrame({'A': [1,2,3], 'B': [2,3,4]}) ddf = dd.from_pandas(df, npartitions = 3) print(ddf.head()) </code></pre> This yields the following result: <pre class="prettyprint"><code> A B 0 1 2 </code></pre> However, when I set npartitions to 1 or 2, I get the expected result: <pre class="prettyprint"><code> A B 0 1 2 1 2 3 2 3 4 </code></pre> It seems to be important, that npartitions is lower than the length of the dataframe. Is this intended?

According to the documentation <code>dd.head()</code> only checks the first partition: <blockquote> <code>head(n=5, compute=True)</code> First n rows of the dataset Caveat, this only checks the first n rows of the first partition. </blockquote> So the answer is yes, <code>dd.head()</code> is influenced by how many partitions are there in your dask dataframe. However the number of rows in the first partition is expected to be larger than the number of rows you usually want to show when using <code>dd.head()</code> — otherwise using dask shouldn't pay off. The only common case when this might not be true is when taking the first <code>n</code> rows/elements after filtering, as explained in this question.

does npartitions influence the result of dask.dataframe.head()?

Tags:

python

pandas

dask

When running the following code, the result of dask.dataframe.head() depends on npartitions:

import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B': [2,3,4]})
ddf = dd.from_pandas(df, npartitions = 3)
print(ddf.head())

This yields the following result:

   A  B
0  1  2

However, when I set npartitions to 1 or 2, I get the expected result:

It seems to be important, that npartitions is lower than the length of the dataframe. Is this intended?

801

asked Jul 09 '16 03:07

Arco Bast

1 Answers

According to the documentation dd.head() only checks the first partition:

head(n=5, compute=True)

First n rows of the dataset

Caveat, this only checks the first n rows of the first partition.

So the answer is yes, dd.head() is influenced by how many partitions are there in your dask dataframe.

However the number of rows in the first partition is expected to be larger than the number of rows you usually want to show when using dd.head() — otherwise using dask shouldn't pay off. The only common case when this might not be true is when taking the first n rows/elements after filtering, as explained in this question.

answered Oct 21 '22 07:10

dukebody

Related questions
                            
                                How should I structure and access a table of data so that I can compare subsets easily in Python 3.5?
                            
                                Is the interaction between python unittest subTest and skipTest defined?
                            
                                Doctest not recognizing __future__.division
                            
                                Explain the difference between these Midpoint Algorithms
                            
                                DataFrame of DataFrames in Python (Pandas)
                            
                                How to design a library public api avoiding to expose internals?
                            
                                Create arg string from ArgumentParser parsed args in Python
                            
                                Is there a complete list of built-in functions that cannot be called with keyword argument?
                            
                                Python meta-analysis library
                            
                                scipy eigh gives negative eigenvalues for positive semidefinite matrix
                            
                                Union over fields having different names using peewee
                            
                                pythonic way to index list of objects
                            
                                Setting values with pandas.DataFrame
                            
                                Python -- Optimize system of inequalities
                            
                                centerline of a polygonal blob (binary image)
                            
                                Change initializer of Variable in Tensorflow
                            
                                How is Python itself tested?
                            
                                How to predict a simple sequence using seq2seq from tensorflow?
                            
                                AttributeError: module 'os' has no attribute 'setsid'
                            
                                How to avoid AttributeError: '_tkinter.tkapp' object has no attribute 'PassCheck'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With