Once I have a dask dataframe, how can I selectively pull columns into an in-memory pandas DataFrame? Say I have an N x M dataframe. How can I create an N x m dataframe where m << M and is arbitrary.
from sklearn.datasets import load_iris
import dask.dataframe as dd
d = load_iris()
df = pd.DataFrame(d.data)
ddf = dd.from_pandas(df, chunksize=100)
in_memory = ddf.iloc[:,2:4].compute()
ddf.map_partitions(lambda x: x.iloc[:,2:4]).compute()
map_partitions
works but it was quite slow on a file that wasn't very large. I hope I am missing something very obvious.
Just like Pandas, Dask DataFrame supports label-based indexing with the . loc accessor for selecting rows or columns, and __getitem__ (square brackets) for selecting just columns. To select rows, the DataFrame's divisions must be known (see Internal Design and Dask DataFrames Best Practices for more information.)
The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways. If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.
The original pandas query took 182 seconds and the optimized Dask query took 19 seconds, which is about 10 times faster. Dask can provide performance boosts over pandas because it can execute common operations in parallel, where pandas is limited to a single core.
Although iloc is not implemented for dask-dataframes, you can achieve the indexing easily enough as follows:
cols = list(ddf.columns[2:4])
ddf[cols].compute()
This has the additional benefit, that dask knows immediately the types of the columns selected, and needs to do no additional work. For the map_partitions
variant, dask at the least needs to check the data types produces, since the function you call is completely arbitrary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With