Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask DataFrame .head() very slow after indexing

Tags:

dask

Not reproducible, but can someone fill in why a .head() call is greatly slowed after indexing?

import dask.dataframe as dd
df = dd.read_parquet("Filepath")
df.head() # takes 10 seconds

df = df.set_index('id')

df.head() # takes 10 minutes +
like image 793
AZhao Avatar asked Sep 17 '25 17:09

AZhao


1 Answers

As stated in the docs, set_index sorts your data according to the new index, such that the divisions along that index split the data into its logical partitions. The sorting is the thing that requires the extra time, but will make operations working on that index much faster once performed. head() on the raw file will fetch from the first data chunk on disc without regard for any ordering.

You are able to set the index without this ordering either with the index= keyword to read_parquet (maybe the data was inherently ordered already?) or with .map_partitions(lambda df: df.set_index(..)), but this raises the obvious question, why would you bother, what are you trying to achieve? If the data were already sorted, then you could also have used set_index(.., sorted=True) and maybe even the divisions keyword, if you happen to have the information - this would not need the sort, and be correspondingly faster.

like image 120
mdurant Avatar answered Sep 23 '25 13:09

mdurant



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!