Is there an equivalent package in R to Python's <code>dask</code>? Specifically for running Machine Learning algorithms on larger-than-memory data sets on a single machine. Link to Python's <code>Dask</code> page: https://dask.pydata.org/en/latest/ From the Dask website: <blockquote> Dask natively scales Python Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love Dask's schedulers scale to thousand-node clusters and its algorithms have been tested on some of the largest supercomputers in the world. But you don't need a massive cluster to get started. Dask ships with schedulers designed for use on personal machines. Many people use Dask today to scale computations on their laptop, using multiple cores for computation and their disk for excess storage. </blockquote>

I am developing a simple library called <code>disk.frame</code> that has the potential to take on <code>dask</code> one day. It uses the <code>fst</code> file format and <code>data.table</code> to manipulate large amounts of data on disk. As of now, it doesn't have a cluster module but given that it uses <code>future</code> in the background and <code>future</code> can have cluster back-ends, it is a possibility in the future. There is also multidplyr in the works by Hadley and co. Currently, I have used disk.frame successful to manipulate datasets with hundreds of million rows of data and hundreds of columns. If you willing to look beyond R then JuliaDB.jl in the Julia ecosystem is something to look out for.

R equivalent of Python's dask

Tags:

python

r

dask

Is there an equivalent package in R to Python's dask? Specifically for running Machine Learning algorithms on larger-than-memory data sets on a single machine.

Link to Python's Dask page: https://dask.pydata.org/en/latest/

From the Dask website:

Dask natively scales Python

Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love

Dask's schedulers scale to thousand-node clusters and its algorithms have been tested on some of the largest supercomputers in the world.

But you don't need a massive cluster to get started. Dask ships with schedulers designed for use on personal machines. Many people use Dask today to scale computations on their laptop, using multiple cores for computation and their disk for excess storage.

321

asked Jun 27 '18 18:06

Adam Bricknell

2 Answers

I am developing a simple library called disk.frame that has the potential to take on dask one day. It uses the fst file format and data.table to manipulate large amounts of data on disk. As of now, it doesn't have a cluster module but given that it uses future in the background and future can have cluster back-ends, it is a possibility in the future.

There is also multidplyr in the works by Hadley and co.

Currently, I have used disk.frame successful to manipulate datasets with hundreds of million rows of data and hundreds of columns.

If you willing to look beyond R then JuliaDB.jl in the Julia ecosystem is something to look out for.

149

answered Oct 09 '22 10:10

xiaodai

As a general matter, R, in its native use, operates on data in RAM. Depending on your operating system, when R requires more than the available memory, portions are swapped out to disk. The normal result is thrashing that will bring your machine to a halt. In Windows, you can watch the Task Manager and cry.

There are a few packages that promise to manage this process. RevoScaleR from Microsoft is one. It is not open source and is not available from CRAN. I am as skeptical of software add-ons to R as bolt-on gadgets that promise better fuel economy in your car. There are always trade-offs.

The simple answer is that there is no free lunch in R. A download will not be as effective as some new DIMMs for your machine. You are better off looking at your code first. If that doesn't work, then hire a properly-sized configuration in the cloud.

answered Oct 09 '22 11:10

Robert Hadow

Related questions
                            
                                AttributeError: module 'pandas' has no attribute 'read_csv' Python3.5
                            
                                Is there any python "utf-8" string constant?
                            
                                pyparsing nestedExpr and nested parentheses
                            
                                How to access serializer.data on ListSerializer parent class in DRF?
                            
                                how to control output from fbprophet?
                            
                                Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError?
                            
                                Stream files to kafka using airflow
                            
                                How to check the channel order of an image?
                            
                                elasticsearch-dsl aggregations returns only 10 results. How to change this
                            
                                Django test runner failing with "relation does not exist" error
                            
                                How to read merged Excel cells with NaN into Pandas DataFrame
                            
                                Keras: binary_crossentropy & categorical_crossentropy confusion
                            
                                C++ and Python ZeroMQ 4.x PUB/SUB example does not work
                            
                                virtualenv activation doesn't work
                            
                                unable to execute python script from php
                            
                                Reticulate not sharing state between R/Python cells or Python/Python cells in RMarkdown
                            
                                Chunking big datasets in PyRFC. Possible?
                            
                                Compile Python 3.6 script to standalone exe with Nuitka on Windows 10
                            
                                Open / load image as numpy ndarray directly
                            
                                BeautifulSoup: Return None if HTML element not found

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With