Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R equivalent of Python's dask

Tags:

python

r

dask

Is there an equivalent package in R to Python's dask? Specifically for running Machine Learning algorithms on larger-than-memory data sets on a single machine.

Link to Python's Dask page: https://dask.pydata.org/en/latest/

From the Dask website:

Dask natively scales Python

Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love

Dask's schedulers scale to thousand-node clusters and its algorithms have been tested on some of the largest supercomputers in the world.

But you don't need a massive cluster to get started. Dask ships with schedulers designed for use on personal machines. Many people use Dask today to scale computations on their laptop, using multiple cores for computation and their disk for excess storage.

like image 321
Adam Bricknell Avatar asked Jun 27 '18 18:06

Adam Bricknell


People also ask

Is Dask same as spark?

Summary. Generally Dask is smaller and lighter weight than Spark. This means that it has fewer features and, instead, is used in conjunction with other libraries, particularly those in the numeric Python ecosystem. It couples with libraries like Pandas or Scikit-Learn to achieve high-level functionality.

Is Dask faster than Pandas?

The original pandas query took 182 seconds and the optimized Dask query took 19 seconds, which is about 10 times faster. Dask can provide performance boosts over pandas because it can execute common operations in parallel, where pandas is limited to a single core.

Does Pandas use Dask?

Dask DataFrame is used in situations where pandas is commonly needed, usually when pandas fails due to data size or computation speed: Manipulating large datasets, even when those datasets don't fit in memory. Accelerating long computations by using many cores.

How good is Dask?

Dask is simply the most revolutionary tool for data processing that I have encountered. If you love Pandas and Numpy but were sometimes struggling with data that would not fit into RAM then Dask is definitely what you need.


2 Answers

I am developing a simple library called disk.frame that has the potential to take on dask one day. It uses the fst file format and data.table to manipulate large amounts of data on disk. As of now, it doesn't have a cluster module but given that it uses future in the background and future can have cluster back-ends, it is a possibility in the future.

There is also multidplyr in the works by Hadley and co.

Currently, I have used disk.frame successful to manipulate datasets with hundreds of million rows of data and hundreds of columns.

If you willing to look beyond R then JuliaDB.jl in the Julia ecosystem is something to look out for.

like image 149
xiaodai Avatar answered Oct 09 '22 10:10

xiaodai


As a general matter, R, in its native use, operates on data in RAM. Depending on your operating system, when R requires more than the available memory, portions are swapped out to disk. The normal result is thrashing that will bring your machine to a halt. In Windows, you can watch the Task Manager and cry.

There are a few packages that promise to manage this process. RevoScaleR from Microsoft is one. It is not open source and is not available from CRAN. I am as skeptical of software add-ons to R as bolt-on gadgets that promise better fuel economy in your car. There are always trade-offs.

The simple answer is that there is no free lunch in R. A download will not be as effective as some new DIMMs for your machine. You are better off looking at your code first. If that doesn't work, then hire a properly-sized configuration in the cloud.

like image 31
Robert Hadow Avatar answered Oct 09 '22 11:10

Robert Hadow