Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subsetting Dask DataFrames

Tags:

python

dask

Is this a valid way of loading subsets of a dask dataframe to memory:

while i < len_df:
    j = i + batch_size 
    if j > len_df: 
        j = len_df
    subset = df.loc[i:j,'source_country_codes'].compute()

I read somewhere that this may not be correct because of how dask assigns index numbers because of it dividing the bigger dataframe into smaller pandas dfs. Also I don't think dask dataframes has an iloc attribute. I am using version 0.15.2

In terms of use cases, this would be a way of loading batches of data to deep learning (say keras).

like image 315
sachinruk Avatar asked Nov 16 '25 09:11

sachinruk


1 Answers

If your dataset has well known divisions then this might work, but instead I recommend just computing one partition at a time.

for part in df.to_delayed():
    subset = part.compute()

You can roughly control the size by repartitioning beforehand

for part in df.repartition(npartitions=100).to_delayed():
    subset = part.compute()

This isn't exactly the same, because it doesn't guarantee a fixed number of rows in each partition, but that guarantee might be quite expensive, depending on how the data is obtained.

like image 50
MRocklin Avatar answered Nov 17 '25 22:11

MRocklin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!