I have a calculation that expects a pandas dataframe as input. I'd like to run this calculation on data stored in a netCDF file that expands to 51GB - currently I've been opening the file with xarray.open_dataset
and using chunks (my understanding is that this opened file is actually a dask array, so only loads chunks of data into memory at a time). However, I don't seem to be able to take advantage of this lazy loading, since I have to convert the xarray data into a pandas dataframe in order to run my calculation - and my understanding is that at that point all the data is loaded into memory (which is bad).
So I guess long story short, my question is: how can I get from an xarray dataset to a pandas dataframe without any intermediate steps that load my entire data into memory? I've seen dask work with pandas.read_csv
, and I see it work with xarray, but I'm not sure how to convert an already opened netCDF xarray dataset to a pandas dataframe in chunks.
Thanks and sorry for the vague question!
This is a good question. This should be doable, but I'm not quite sure what the right approach is.
Ideally, we could simply implement a xarray.Dataset.to_dask_dataframe()
method. But there are several challenges here -- the biggest one being that dask currently does not support dataframes with a MultiIndex.
Alternatively, you might want to construct a list of dask.Delayed
objects holding pandas.DataFrames
for each chunk of the xarray.Dataset
. Toward this end, it would be nice if xarray had something like dask.array's to_delayed
method for converting a Dataset into an array of delayed datasets, which you could then lazily convert into DataFrame objects and do your computation.
I encourage you to open an issue on either the dask or xarray GitHub pages to discuss, especially if you might be interested in contributing code. EDIT: You can find that issue here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With