I'm developing a model for financial purpose. I have the entire S&P500 components inside a folder, stored as many .hdf files. Each .hdf file has its own multi-index (year-week-minute). An example of the sequential code (non parallelized): <pre class="prettyprint"><code>import os from classAsset import Asset def model(current_period, previous_perdiod): # do stuff on the current period, based on stats derived from previous_period return results if __name__ == '__main__': for hdf_file in os.listdir('data_path'): asset = Asset(hdf_file) for year in asset.data.index.get_level_values(0).unique().values: for week in asset.data.loc[year].index.get_level_values(0).unique().values: previous_period = asset.data.loc[(start):(end)].Open.values # start and end are defined in another function current_period = asset.data.loc[year, week].Open.values model(current_period, previous_period) </code></pre> To speed up the process, I'm using multiprocessing.pool to run the same algorithm on multiple .hdf files at the same time, so I'm quite satisfied with the processing speed (I have a 4c/8t CPU). But now I discovered Dask. In Dask documentation 'DataFrame Overview' they indicate: Trivially parallelizable operations (fast): <ul> <li>Elementwise operations: df.x + df.y, df * df</li> <li>Row-wise selections: df[df.x > 0]</li> <li>Loc: df.loc[4.0:10.5] (this is what interests me the most)</li> </ul> Also, in Dask documentation 'Use Cases' they indicate: <blockquote> A programmer has a function that they want to run many times on different inputs. Their function and inputs might use arrays or dataframes internally, but conceptually their problem isn’t a single large array or dataframe. They want to run these functions in parallel on their laptop while they prototype but they also intend to eventually use an in-house cluster. They wrap their function in dask.delayed and let the appropriate dask scheduler parallelize and load balance the work. </blockquote> So I'm sure I'm missing something, or probably more than just something. What's the difference between processing many single pandas dataframes with multiprocessing.pool and dask.multiprocessing? Do you think I should use Dask for my specific case? Thank you guys.

There is no difference. Dask is doing just what you are doing in your custom code. It uses pandas and a thread or multiprocessing pool for parallelism. You might prefer Dask for a few reasons <ol> <li>It would figure out how to write the parallel algorithms automatically</li> <li>You may want to scale to a cluster in the future</li> </ol> But if what you have works well for you then I would just stay with that.

dask.multiprocessing or pandas + multiprocessing.pool: what's the difference?

Tags:

python

pandas

multithreading

multiprocessing

dask

I'm developing a model for financial purpose. I have the entire S&P500 components inside a folder, stored as many .hdf files. Each .hdf file has its own multi-index (year-week-minute).

An example of the sequential code (non parallelized):

import os
from classAsset import Asset


def model(current_period, previous_perdiod):
    # do stuff on the current period, based on stats derived from previous_period
    return results

if __name__ == '__main__':
    for hdf_file in os.listdir('data_path'):
        asset = Asset(hdf_file)
        for year in asset.data.index.get_level_values(0).unique().values:
            for week in asset.data.loc[year].index.get_level_values(0).unique().values:

                previous_period = asset.data.loc[(start):(end)].Open.values  # start and end are defined in another function
                current_period = asset.data.loc[year, week].Open.values

                model(current_period, previous_period)

To speed up the process, I'm using multiprocessing.pool to run the same algorithm on multiple .hdf files at the same time, so I'm quite satisfied with the processing speed (I have a 4c/8t CPU). But now I discovered Dask.

In Dask documentation 'DataFrame Overview' they indicate:

Trivially parallelizable operations (fast):

Elementwise operations: df.x + df.y, df * df
Row-wise selections: df[df.x > 0]
Loc: df.loc[4.0:10.5] (this is what interests me the most)

Also, in Dask documentation 'Use Cases' they indicate:

A programmer has a function that they want to run many times on different inputs. Their function and inputs might use arrays or dataframes internally, but conceptually their problem isn’t a single large array or dataframe.

They want to run these functions in parallel on their laptop while they prototype but they also intend to eventually use an in-house cluster. They wrap their function in dask.delayed and let the appropriate dask scheduler parallelize and load balance the work.

So I'm sure I'm missing something, or probably more than just something. What's the difference between processing many single pandas dataframes with multiprocessing.pool and dask.multiprocessing?

Do you think I should use Dask for my specific case? Thank you guys.

433

asked Oct 15 '17 11:10

ilpomo

1 Answers

There is no difference. Dask is doing just what you are doing in your custom code. It uses pandas and a thread or multiprocessing pool for parallelism.

You might prefer Dask for a few reasons

It would figure out how to write the parallel algorithms automatically
You may want to scale to a cluster in the future

But if what you have works well for you then I would just stay with that.

answered Oct 23 '22 04:10

MRocklin

Related questions
                            
                                Pandas find sequence or pattern in column
                            
                                UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b
                            
                                How to create a circular thumbnail using python pillow and overlap on background image
                            
                                how to update django cached_property
                            
                                Custom sorting columns in multi level pandas dataframe
                            
                                python requests POST 400 error
                            
                                Python with Selenium: Drag and Drop from file system to webdriver?
                            
                                Matrix multiplication in Keras
                            
                                create igraph Graph from pandas dataframe
                            
                                How do you Configure Python Keyring to pull credentials from Windows Credential Manager on Windows 7?
                            
                                PyCharm: “Simplify Chained Comparison” [duplicate]
                            
                                how to calculate max value in some columns per row in pyspark
                            
                                How to ignore values when using numpy.sum and numpy.mean in matrices
                            
                                Get list of query results in Peewee
                            
                                How to delete a specific message by ID using discord.py
                            
                                Use hidden states instead of outputs in LSTMs of keras
                            
                                TensorFlow - object detection module, error appear when trying to use protoc
                            
                                Three sum algorithm solution
                            
                                Pandas select n middle rows
                            
                                How to create new values in a pandas dataframe column based on values from another column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With