Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparison between Modin | Dask | Data.table | Pandas for parallel processing and out of memory csv files

What are the fundamental difference and primary use-cases for Dask | Modin | Data.table

I checked the documentation of each libraries, all of them seem to offer a 'similar' solution to pandas limitations

like image 359
Shubham Samant Avatar asked Jun 06 '19 19:06

Shubham Samant


People also ask

Is VAEX faster than pandas?

Vaex Python is an alternative to the Pandas library that take less time to do computations on huge data using Out of Core Dataframe. It has fast, interactive visualization capabilities as well. Pandas is the most widely used python library for dealing with dataframes and processing.

Is Dask apply faster than pandas?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.

Is there something better than pandas?

modin is another pandas alternative to speed up functions while keeping the syntax largely the same. modin works by utilizing the multiple cores available on a machine (like your laptop, for instance) to run pandas operations in parallel.

Should I use Dask instead of pandas?

Dask is a great way to scale up your Pandas code. Naively converting your Pandas DataFrame into a Dask DataFrame is not the right way to do it. The fundamental shift should not be to replace Pandas with Dask, but to re-use the algorithms, code, and methods you wrote for a single Python process.


1 Answers

I have a task of dealing with daily stock trading data and came across this post. The length of my rows is about 60 million and length of the columns is below 10. I tested with all 3 libraries in read_csv and groupby mean. Based upon this little test my choice is dask. Below is a comparison of the 3:

| library      | `read_csv` time | `groupby` time |
|--------------|-----------------|----------------|
| modin        | 175s            | 150s           |
| dask         | 0s (lazy load)  | 27s            |
| dask persist | 26s             | 1s             |
| datatable    | 8s              | 6s             |

It seems that modin is not as efficient as dask at the moment, at least for my data. dask persist tells dask that your data could fit into memory so it take some time for dask to put everything in instead of lazy loading. datatable originally has all data in memory and is super fast in both read_csv and groupby. However, given its incompatibility with pandas it seems better to use dask. Actually I came from R and I was very familiar with R's data.table so I have no problem applying its syntax in python. If datatable in python could seamlessly connected to pandas (like it did with data.frame in R) then it would have been my choice.

like image 103
zxzb Avatar answered Sep 21 '22 12:09

zxzb