What are the fundamental difference and primary use-cases for Dask | Modin | Data.table
I checked the documentation of each libraries, all of them seem to offer a 'similar' solution to pandas limitations
Vaex Python is an alternative to the Pandas library that take less time to do computations on huge data using Out of Core Dataframe. It has fast, interactive visualization capabilities as well. Pandas is the most widely used python library for dealing with dataframes and processing.
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
modin is another pandas alternative to speed up functions while keeping the syntax largely the same. modin works by utilizing the multiple cores available on a machine (like your laptop, for instance) to run pandas operations in parallel.
Dask is a great way to scale up your Pandas code. Naively converting your Pandas DataFrame into a Dask DataFrame is not the right way to do it. The fundamental shift should not be to replace Pandas with Dask, but to re-use the algorithms, code, and methods you wrote for a single Python process.
I have a task of dealing with daily stock trading data and came across this post. The length of my rows is about 60 million and length of the columns is below 10. I tested with all 3 libraries in read_csv
and groupby mean
. Based upon this little test my choice is dask
. Below is a comparison of the 3:
| library | `read_csv` time | `groupby` time |
|--------------|-----------------|----------------|
| modin | 175s | 150s |
| dask | 0s (lazy load) | 27s |
| dask persist | 26s | 1s |
| datatable | 8s | 6s |
It seems that modin
is not as efficient as dask
at the moment, at least for my data. dask persist
tells dask
that your data could fit into memory so it take some time for dask to put everything in instead of lazy loading. datatable
originally has all data in memory and is super fast in both read_csv and groupby. However, given its incompatibility with pandas it seems better to use dask
. Actually I came from R and I was very familiar with R's data.table so I have no problem applying its syntax in python. If datatable
in python could seamlessly connected to pandas (like it did with data.frame in R) then it would have been my choice.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With