Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dask.multiprocessing or pandas + multiprocessing.pool: what's the difference?

I'm developing a model for financial purpose. I have the entire S&P500 components inside a folder, stored as many .hdf files. Each .hdf file has its own multi-index (year-week-minute).

An example of the sequential code (non parallelized):

import os
from classAsset import Asset


def model(current_period, previous_perdiod):
    # do stuff on the current period, based on stats derived from previous_period
    return results

if __name__ == '__main__':
    for hdf_file in os.listdir('data_path'):
        asset = Asset(hdf_file)
        for year in asset.data.index.get_level_values(0).unique().values:
            for week in asset.data.loc[year].index.get_level_values(0).unique().values:

                previous_period = asset.data.loc[(start):(end)].Open.values  # start and end are defined in another function
                current_period = asset.data.loc[year, week].Open.values

                model(current_period, previous_period)

To speed up the process, I'm using multiprocessing.pool to run the same algorithm on multiple .hdf files at the same time, so I'm quite satisfied with the processing speed (I have a 4c/8t CPU). But now I discovered Dask.

In Dask documentation 'DataFrame Overview' they indicate:

Trivially parallelizable operations (fast):

  • Elementwise operations: df.x + df.y, df * df
  • Row-wise selections: df[df.x > 0]
  • Loc: df.loc[4.0:10.5] (this is what interests me the most)

Also, in Dask documentation 'Use Cases' they indicate:

A programmer has a function that they want to run many times on different inputs. Their function and inputs might use arrays or dataframes internally, but conceptually their problem isn’t a single large array or dataframe.

They want to run these functions in parallel on their laptop while they prototype but they also intend to eventually use an in-house cluster. They wrap their function in dask.delayed and let the appropriate dask scheduler parallelize and load balance the work.

So I'm sure I'm missing something, or probably more than just something. What's the difference between processing many single pandas dataframes with multiprocessing.pool and dask.multiprocessing?

Do you think I should use Dask for my specific case? Thank you guys.

like image 433
ilpomo Avatar asked Oct 15 '17 11:10

ilpomo


People also ask

What is the difference between dask and pandas?

According to dask documentation: Generally speaking, Dask. dataframe groupby-aggregations are roughly same performance as pandas groupby-aggregations, just more scalable. The performance for computing aggregations is equally the same. But compared to pandas, dask is able to scale the solution in a cluster.

Does dask use multiprocessing?

dask. bag uses the multiprocessing scheduler by default.

Is dask merge faster than pandas?

Using dask instead of pandas to merge large data sets The python package dask is a powerful python package that allows you to do data analytics in parallel which means it should be faster and more memory efficient than pandas .

Does dask use less memory than pandas?

The Dask version uses far less memory than the naive version, and finishes fastest (assuming you have CPUs to spare).


1 Answers

There is no difference. Dask is doing just what you are doing in your custom code. It uses pandas and a thread or multiprocessing pool for parallelism.

You might prefer Dask for a few reasons

  1. It would figure out how to write the parallel algorithms automatically
  2. You may want to scale to a cluster in the future

But if what you have works well for you then I would just stay with that.

like image 97
MRocklin Avatar answered Oct 23 '22 04:10

MRocklin