I'm developing a model for financial purpose. I have the entire S&P500 components inside a folder, stored as many .hdf files. Each .hdf file has its own multi-index (year-week-minute).
An example of the sequential code (non parallelized):
import os
from classAsset import Asset
def model(current_period, previous_perdiod):
# do stuff on the current period, based on stats derived from previous_period
return results
if __name__ == '__main__':
for hdf_file in os.listdir('data_path'):
asset = Asset(hdf_file)
for year in asset.data.index.get_level_values(0).unique().values:
for week in asset.data.loc[year].index.get_level_values(0).unique().values:
previous_period = asset.data.loc[(start):(end)].Open.values # start and end are defined in another function
current_period = asset.data.loc[year, week].Open.values
model(current_period, previous_period)
To speed up the process, I'm using multiprocessing.pool to run the same algorithm on multiple .hdf files at the same time, so I'm quite satisfied with the processing speed (I have a 4c/8t CPU). But now I discovered Dask.
In Dask documentation 'DataFrame Overview' they indicate:
Trivially parallelizable operations (fast):
Also, in Dask documentation 'Use Cases' they indicate:
A programmer has a function that they want to run many times on different inputs. Their function and inputs might use arrays or dataframes internally, but conceptually their problem isn’t a single large array or dataframe.
They want to run these functions in parallel on their laptop while they prototype but they also intend to eventually use an in-house cluster. They wrap their function in dask.delayed and let the appropriate dask scheduler parallelize and load balance the work.
So I'm sure I'm missing something, or probably more than just something. What's the difference between processing many single pandas dataframes with multiprocessing.pool and dask.multiprocessing?
Do you think I should use Dask for my specific case? Thank you guys.
According to dask documentation: Generally speaking, Dask. dataframe groupby-aggregations are roughly same performance as pandas groupby-aggregations, just more scalable. The performance for computing aggregations is equally the same. But compared to pandas, dask is able to scale the solution in a cluster.
dask. bag uses the multiprocessing scheduler by default.
Using dask instead of pandas to merge large data sets The python package dask is a powerful python package that allows you to do data analytics in parallel which means it should be faster and more memory efficient than pandas .
The Dask version uses far less memory than the naive version, and finishes fastest (assuming you have CPUs to spare).
There is no difference. Dask is doing just what you are doing in your custom code. It uses pandas and a thread or multiprocessing pool for parallelism.
You might prefer Dask for a few reasons
But if what you have works well for you then I would just stay with that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With