Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask Vs Rapids. What does rapids provide which dask doesn't have?

I want to understand what is the difference between dask and rapids, what benefits does rapids provides which dask doesn't have.

Does rapids internally use dask code? If so then why do we have dask, cause even dask can interact with GPU.

like image 798
DjVasu Avatar asked Mar 18 '20 11:03

DjVasu


2 Answers

Dask is a Python library which enables out of core parallelism and distribution of some popular Python libraries as well as custom functions.

Take Pandas for example. Pandas is a popular library for working with Dataframes in Python. However it is single-threaded and the Dataframes you are working on must fit within memory.

Dask has a subpackage called dask.dataframe which follows most of the same API as Pandas but instead breaks your Dataframe down into partitions which can be operated on in parallel and can be swapped in and out of memory. Dask uses Pandas under the hood, so each partition is a valid Pandas Dataframe.

The overall Dask Dataframe can scale out and use multiple cores or multiple machines.


RAPIDS is a collection of GPU accelerated Python libraries which follow the API of other popular Python packages.

To continue with our Pandas theme, RAPIDS has a package called cuDF, which has much of the same API as Pandas. However cuDF stores Dataframes in GPU memory and uses the GPU to perform computations.

As GPUs can accelerate computations and this can lead to performance benefits for your Dataframe operations and enables you to scale up your workflow.


RAPIDS and Dask also work together and Dask is considered a component of RAPIDS because of this. So instead of having a Dask Dataframe made up of individual Pandas Dataframes you could instead have one made up of cuDF Dataframes. This is possible because they follow the same API.

This way you can both scale up by using a GPU and also scale out using multiple GPUs on multiple machines.

like image 147
Jacob Tomlinson Avatar answered Oct 13 '22 20:10

Jacob Tomlinson


Dask provides the ability to distribute a job. Dask can scale both horizontally (multiple machines) and vertically (same machine).

RAPIDS provides a set of PyData APIs which are GPU-Accelerated. Pandas (cuDF), Scikit-learn (cuML), NumPy (CuPy), etc.. are GPU-Accelerated with RAPIDS. This means that you can use the code you already wrote against those APIs and just swap in the RAPIDS library and benefit from GPU-Acceleration.

When you combine Dask and RAPIDS together, you basically get a framework (Dask) that scales horizontally and vertically, and PyData APIs (RAPIDS) which can leverage underlying GPUs.

If you look at broader solutions, Dask can then integrate with orchestration tools like Kubernetes and SLURM to be able to provide even better resource utilization across a large environment.

like image 33
Jim Scott Avatar answered Oct 13 '22 20:10

Jim Scott