Subset dask dataframe by column position

Q: How do I select columns in Dask DataFrame?

Just like Pandas, Dask DataFrame supports label-based indexing with the . loc accessor for selecting rows or columns, and __getitem__ (square brackets) for selecting just columns. To select rows, the DataFrame's divisions must be known (see Internal Design and Dask DataFrames Best Practices for more information.)

Q: What is Npartitions in Dask DataFrame?

The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways. If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.

Q: Is Dask faster than pandas?

The original pandas query took 182 seconds and the optimized Dask query took 19 seconds, which is about 10 times faster. Dask can provide performance boosts over pandas because it can execute common operations in parallel, where pandas is limited to a single core.

Once I have a dask dataframe, how can I selectively pull columns into an in-memory pandas DataFrame? Say I have an N x M dataframe. How can I create an N x m dataframe where m << M and is arbitrary.

from sklearn.datasets import load_iris
import dask.dataframe as dd

d = load_iris()
df = pd.DataFrame(d.data)
ddf = dd.from_pandas(df, chunksize=100)

What I would like to do:

in_memory = ddf.iloc[:,2:4].compute()

What I have been able to do:

ddf.map_partitions(lambda x: x.iloc[:,2:4]).compute()

map_partitions works but it was quite slow on a file that wasn't very large. I hope I am missing something very obvious.

How do I select columns in Dask DataFrame?

Just like Pandas, Dask DataFrame supports label-based indexing with the . loc accessor for selecting rows or columns, and __getitem__ (square brackets) for selecting just columns. To select rows, the DataFrame's divisions must be known (see Internal Design and Dask DataFrames Best Practices for more information.)

What is Npartitions in Dask DataFrame?

The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways. If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.

Is Dask faster than pandas?

The original pandas query took 182 seconds and the optimized Dask query took 19 seconds, which is about 10 times faster. Dask can provide performance boosts over pandas because it can execute common operations in parallel, where pandas is limited to a single core.

Although iloc is not implemented for dask-dataframes, you can achieve the indexing easily enough as follows:

cols = list(ddf.columns[2:4])
ddf[cols].compute()

This has the additional benefit, that dask knows immediately the types of the columns selected, and needs to do no additional work. For the map_partitions variant, dask at the least needs to check the data types produces, since the function you call is completely arbitrary.

Subset dask dataframe by column position

Tags:

python

pandas

dask

What I would like to do:

What I have been able to do:

Zelazny7

People also ask

Video Answer

1 Answers

mdurant

Recent Activity

Donate For Us

Subset dask dataframe by column position

Tags:

python

pandas

dask

What I would like to do:

What I have been able to do:

Zelazny7

People also ask

Video Answer

1 Answers

mdurant

Related questions

Recent Activity

Donate For Us