Is there an equivalent package in R to Python's dask
? Specifically for running Machine Learning algorithms on larger-than-memory data sets on a single machine.
Link to Python's Dask
page:
https://dask.pydata.org/en/latest/
From the Dask website:
Dask natively scales Python
Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
Dask's schedulers scale to thousand-node clusters and its algorithms have been tested on some of the largest supercomputers in the world.
But you don't need a massive cluster to get started. Dask ships with schedulers designed for use on personal machines. Many people use Dask today to scale computations on their laptop, using multiple cores for computation and their disk for excess storage.
Summary. Generally Dask is smaller and lighter weight than Spark. This means that it has fewer features and, instead, is used in conjunction with other libraries, particularly those in the numeric Python ecosystem. It couples with libraries like Pandas or Scikit-Learn to achieve high-level functionality.
The original pandas query took 182 seconds and the optimized Dask query took 19 seconds, which is about 10 times faster. Dask can provide performance boosts over pandas because it can execute common operations in parallel, where pandas is limited to a single core.
Dask DataFrame is used in situations where pandas is commonly needed, usually when pandas fails due to data size or computation speed: Manipulating large datasets, even when those datasets don't fit in memory. Accelerating long computations by using many cores.
Dask is simply the most revolutionary tool for data processing that I have encountered. If you love Pandas and Numpy but were sometimes struggling with data that would not fit into RAM then Dask is definitely what you need.
I am developing a simple library called disk.frame
that has the potential to take on dask
one day. It uses the fst
file format and data.table
to manipulate large amounts of data on disk. As of now, it doesn't have a cluster module but given that it uses future
in the background and future
can have cluster back-ends, it is a possibility in the future.
There is also multidplyr in the works by Hadley and co.
Currently, I have used disk.frame successful to manipulate datasets with hundreds of million rows of data and hundreds of columns.
If you willing to look beyond R then JuliaDB.jl in the Julia ecosystem is something to look out for.
As a general matter, R, in its native use, operates on data in RAM. Depending on your operating system, when R requires more than the available memory, portions are swapped out to disk. The normal result is thrashing that will bring your machine to a halt. In Windows, you can watch the Task Manager and cry.
There are a few packages that promise to manage this process. RevoScaleR from Microsoft is one. It is not open source and is not available from CRAN. I am as skeptical of software add-ons to R as bolt-on gadgets that promise better fuel economy in your car. There are always trade-offs.
The simple answer is that there is no free lunch in R. A download will not be as effective as some new DIMMs for your machine. You are better off looking at your code first. If that doesn't work, then hire a properly-sized configuration in the cloud.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With