Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subset dask dataframe by column position

Once I have a dask dataframe, how can I selectively pull columns into an in-memory pandas DataFrame? Say I have an N x M dataframe. How can I create an N x m dataframe where m << M and is arbitrary.

from sklearn.datasets import load_iris
import dask.dataframe as dd

d = load_iris()
df = pd.DataFrame(d.data)
ddf = dd.from_pandas(df, chunksize=100)

What I would like to do:

in_memory = ddf.iloc[:,2:4].compute()

What I have been able to do:

ddf.map_partitions(lambda x: x.iloc[:,2:4]).compute()

map_partitions works but it was quite slow on a file that wasn't very large. I hope I am missing something very obvious.

like image 904
Zelazny7 Avatar asked May 24 '17 19:05

Zelazny7


People also ask

How do I select columns in Dask DataFrame?

Just like Pandas, Dask DataFrame supports label-based indexing with the . loc accessor for selecting rows or columns, and __getitem__ (square brackets) for selecting just columns. To select rows, the DataFrame's divisions must be known (see Internal Design and Dask DataFrames Best Practices for more information.)

What is Npartitions in Dask DataFrame?

The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways. If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.

Is Dask faster than pandas?

The original pandas query took 182 seconds and the optimized Dask query took 19 seconds, which is about 10 times faster. Dask can provide performance boosts over pandas because it can execute common operations in parallel, where pandas is limited to a single core.


Video Answer


1 Answers

Although iloc is not implemented for dask-dataframes, you can achieve the indexing easily enough as follows:

cols = list(ddf.columns[2:4])
ddf[cols].compute()

This has the additional benefit, that dask knows immediately the types of the columns selected, and needs to do no additional work. For the map_partitions variant, dask at the least needs to check the data types produces, since the function you call is completely arbitrary.

like image 163
mdurant Avatar answered Sep 19 '22 02:09

mdurant