Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

does npartitions influence the result of dask.dataframe.head()?

When running the following code, the result of dask.dataframe.head() depends on npartitions:

import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B': [2,3,4]})
ddf = dd.from_pandas(df, npartitions = 3)
print(ddf.head())

This yields the following result:

   A  B
0  1  2

However, when I set npartitions to 1 or 2, I get the expected result:

   A  B
0  1  2
1  2  3
2  3  4

It seems to be important, that npartitions is lower than the length of the dataframe. Is this intended?

like image 801
Arco Bast Avatar asked Jul 09 '16 03:07

Arco Bast


People also ask

How to improve the performance of DASK DataFrames?

Increase the number of days or reduce the frequency to practice with a larger dataset. Unlike Pandas, Dask DataFrames are lazy and so no data is printed here. ... ... ... ... ... ... ...

What is the difference between pandas Dataframe and DASK Dataframe?

A Dataframe is simply a two-dimensional data structure used to align data in a tabular form consisting of rows and columns. A Dask DataFrame is composed of many smaller Pandas DataFrames that are split row-wise along the index. An operation on a single Dask DataFrame triggers many operations on the Pandas DataFrames that constitutes it.

How to convert DASK bag to DASK Dataframe in Python?

Use reindex afterward if necessary. “`python # Converting dask bag into dask dataframe dataframe=my_bag.to_dataframe () dataframe.compute () “` **2. How to create `Dask.Delayed` object from Dask bag** You can convert `dask.bag` into a list of `dask.delayed` objects, one per partition using the `dask.bagto_delayed ()` function.

What is the structure of the DASK series in pandas?

Dask Series Structure: npartitions=1 float64 ... Name: x, dtype: float64 Dask Name: sqrt, 157 tasks Call .compute () when you want your result as a Pandas dataframe.


1 Answers

According to the documentation dd.head() only checks the first partition:

head(n=5, compute=True)

First n rows of the dataset

Caveat, this only checks the first n rows of the first partition.

So the answer is yes, dd.head() is influenced by how many partitions are there in your dask dataframe.

However the number of rows in the first partition is expected to be larger than the number of rows you usually want to show when using dd.head() — otherwise using dask shouldn't pay off. The only common case when this might not be true is when taking the first n rows/elements after filtering, as explained in this question.

like image 97
dukebody Avatar answered Oct 21 '22 07:10

dukebody