What are the fundamental difference and primary use-cases for Dask | Modin | Data.table I checked the documentation of each libraries, all of them seem to offer a 'similar' solution to pandas limitations

I have a task of dealing with daily stock trading data and came across this post. The length of my rows is about 60 million and length of the columns is below 10. I tested with all 3 libraries in <code>read_csv</code> and <code>groupby mean</code>. Based upon this little test my choice is <code>dask</code>. Below is a comparison of the 3: <pre class="prettyprint"><code>| library | `read_csv` time | `groupby` time | |--------------|-----------------|----------------| | modin | 175s | 150s | | dask | 0s (lazy load) | 27s | | dask persist | 26s | 1s | | datatable | 8s | 6s | </code></pre> It seems that <code>modin</code> is not as efficient as <code>dask</code> at the moment, at least for my data. <code>dask persist</code> tells <code>dask</code> that your data could fit into memory so it take some time for dask to put everything in instead of lazy loading. <code>datatable</code> originally has all data in memory and is super fast in both read_csv and groupby. However, given its incompatibility with pandas it seems better to use <code>dask</code>. Actually I came from R and I was very familiar with R's data.table so I have no problem applying its syntax in python. If <code>datatable</code> in python could seamlessly connected to pandas (like it did with data.frame in R) then it would have been my choice.

Comparison between Modin | Dask | Data.table | Pandas for parallel processing and out of memory csv files

1 Answers

I have a task of dealing with daily stock trading data and came across this post. The length of my rows is about 60 million and length of the columns is below 10. I tested with all 3 libraries in read_csv and groupby mean. Based upon this little test my choice is dask. Below is a comparison of the 3:

| library      | `read_csv` time | `groupby` time |
|--------------|-----------------|----------------|
| modin        | 175s            | 150s           |
| dask         | 0s (lazy load)  | 27s            |
| dask persist | 26s             | 1s             |
| datatable    | 8s              | 6s             |

It seems that modin is not as efficient as dask at the moment, at least for my data. dask persist tells dask that your data could fit into memory so it take some time for dask to put everything in instead of lazy loading. datatable originally has all data in memory and is super fast in both read_csv and groupby. However, given its incompatibility with pandas it seems better to use dask. Actually I came from R and I was very familiar with R's data.table so I have no problem applying its syntax in python. If datatable in python could seamlessly connected to pandas (like it did with data.frame in R) then it would have been my choice.

103

answered Sep 21 '22 12:09

zxzb

Related questions
                            
                                from utils import label_map_util Import Error: No module named utils
                            
                                Why is not 'decimal.Decimal(1)' an instance of 'numbers.Real'?
                            
                                Why is len() not implemented for Queues?
                            
                                Django 2.0: sqlite IntegrityError: FOREIGN KEY constraint failed
                            
                                Get Excel cell background color in pandas read_excel?
                            
                                What is the grep equivalent in Python?
                            
                                How to save numpy ndarray as .csv file?
                            
                                Statistical Profiling in Python
                            
                                Jupyter notebook has become very slow suddenly
                            
                                "import torch" giving error "from torch._C import *, DLL load failed: The specified module could not be found"
                            
                                Efficient random generator for very large range (in python)
                            
                                What is a "cell class" in Keras?
                            
                                Airflow webserver gives cron error for dags with None as schedule interval
                            
                                Understanding Bilinear Layers
                            
                                Bicubic interpolation Python
                            
                                Convert Python dictionary to yaml
                            
                                Print specific keys and values from a deep nested dictionary in python 3.X
                            
                                Pytest skips test saying "asyncio not installed"
                            
                                Efficiently replace elements in array based on dictionary - NumPy / Python
                            
                                Pandas: TypeError: '>' not supported between instances of 'int' and 'str' when selecting on date column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Comparison between Modin | Dask | Data.table | Pandas for parallel processing and out of memory csv files

Tags:

python

pandas

dask

modin

Shubham Samant

People also ask

1 Answers

zxzb

Recent Activity

Donate For Us