Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do Dask dataframes handle larger-than-memory datasets?

The documentation of the Dask package for dataframes says:

Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads.

But later in the same page:

One dask DataFrame is comprised of several in-memory pandas DataFrames separated along the index.

Does Dask read the different DataFrame partitions from disk sequentally and perform computations to fit into memory? Does it spill some partitions to disk when needed? In general, how does Dask manage the memory <--> disk IO of data to allow larger-than-memory data analysis?

I tried to perform some basic computations (e.g. mean rating) on the 10M MovieLens dataset and my laptop (8GB RAM) started to swap.

like image 266
dukebody Avatar asked Mar 28 '16 19:03

dukebody


2 Answers

Dask.dataframe loads data lazily and attempts to perform your entire computation in one linear scan through the dataset. Surprisingly, this is usually doable.

Intelligently dumping down to disk is also an option that it can manage, especially when shuffles are required, but generally there are ways around this.

like image 122
MRocklin Avatar answered Oct 12 '22 19:10

MRocklin


I happen to come to this page after 2 years and now there is an easy option to limit memory usage by each worker. Think that was included by @MRocklin after this thread got inactive.

$ dask-worker tcp://scheduler:port --memory-limit=auto  # total available RAM on the machine
$ dask-worker tcp://scheduler:port --memory-limit=4e9  # four gigabytes per worker process.

This feature is called Spill-to-disk policy for workers and details can be found here in the documentation.

Apparently, extra data will be spilled to a directory as specified by the command below:

$ dask-worker tcp://scheduler:port --memory-limit 4e9 --local-directory /scratch 

That data is still available and will be read back from disk when necessary.

like image 31
Nabin Avatar answered Oct 12 '22 20:10

Nabin