Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing dask collection to files/CSV asynchronously

I'm implementing various kinds of data processing pipelines using dask.distributed. Usually the original data is read from S3 and in the end processed (large) collection would be written to CSV on S3 as well.

I can run the processing asynchonously and monitor progress, but I've noticed that all to_xxx() methods that store collections to file(s) seem to be synchronous calls. One downside of it is that the call blocks, potentially for a very long time. Second, I cannot easily construct a complete graph to be executed later.

Is there a way to run e.g. to_csv() asynchronously and get a future object instead of blocking?

PS: I'm pretty sure that I can implement async storage myself, e.g. by converting collection to delayed() and storing each partition. But it seems like a common case - unless I missed already existing feature it would be nice to have something like this included in the framework.

like image 325
evilkonrex Avatar asked Feb 05 '26 10:02

evilkonrex


1 Answers

Most to_* functions have a compute=True keyword argument that can be replaced with compute=False. In these cases it will return a sequence of delayed values that you can then compute asynchronously

values = df.to_csv('s3://...', compute=False)
futures = client.compute(values)
like image 104
MRocklin Avatar answered Feb 09 '26 07:02

MRocklin