Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributed file systems supported by Python/Dask

Which distributed file systems are supported by Dask? Specifically, from which file systems one could read dask.dataframe's? From the Dask documentation I can see that HDFS is certainly supported. Are any other distributed file systems supported, e.g. Ceph, etc?

I could find some discussion on thoughts to support other file systems here: https://github.com/dask/distributed/issues/33 but no final conclusion, except that HDFS is "nastier" than other options.

Thank you for your help!

like image 461
S.V Avatar asked Jan 25 '26 12:01

S.V


2 Answers

The simplest answer is, that if you can mount the filesystems onto every node, i.e., that it can be accessed as a local filesystem, then you can use any distributed system - without any performance optimisation for the original location of any given file chunk.

I cases where you have data location available from a metadata service (which would be true for ceph), you could limit loading tasks to run only on machines where the data is resident. This is not implemented, but maybe would be not too complicated from the user side. A similar thing was done in the past for hdfs, but we found that the optimisation did not justify the extra complexity of the code.

like image 128
mdurant Avatar answered Jan 27 '26 01:01

mdurant


Documentation on which remote filesystems are currently supported by Dask, and how to support additional file systems is available here:

  • http://dask.pydata.org/en/latest/remote-data-services.html
like image 35
MRocklin Avatar answered Jan 27 '26 00:01

MRocklin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!