I want to lazily create a Dask dataframe from a generator, which looks something like:
[parser.read(local_file_name) for local_file_name in repo.download_files())]
Where both parser.read and repo.download_files return generators (using yield). parser.read yields a dictionary of key-value pairs, which (if I was just using plain pandas) would collect each dictionary in to a list, and then use:
df = pd.DataFrame(parsed_rows)
What's the best way to create a dask dataframe from this? The reason is that a) I don't know necessarily the number of results returned, and b) I don't know the memory allocation of the machine that it will be deployed on.
Alternatively what should I be doing differently (e.g. maybe create a bunch of dataframes and then put those in to dask instead?)
Thanks.
If you want to use the single-machine Dask scheduler then you'll need to know how many files you have to begin with. This might be something like the following:
filenames = repo.download_files()
dataframes = [delayed(load)(filename) for filename in filenames]
df = dd.from_delayed(dataframes)
If you use the distributed scheduler you can add new computations on the fly, but this is a bit more advanced.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With