How to parallelize groupby() in dask?

Tags:

I tried:

df.groupby('name').agg('count').compute(num_workers=1)
df.groupby('name').agg('count').compute(num_workers=4)

They take the same time, why num_workers does not work?

Thanks

522

asked Apr 09 '19 19:04

Robin1988

1 Answers

By default, Dask will work with multi-threaded tasks which means it uses a single processor on your computer. (Note that using dask is nevertheless interesting if you have data that can't fit in memory)

If you want to use several processors to compute your operation, you have to use a different scheduler:

from dask import dataframe as dd
from dask.distributed import LocalCluster, Client

df = dd.read_csv("data.csv")

def group(num_workers): 
    start = time.time() 
    res = df.groupby("name").agg("count").compute(num_workers=num_workers) 
    end = time.time() 
    return res, end-start

print(group(4))

clust = LocalCluster()
clt = Client(clust, set_as_default=True) 
print(group(4))

Here, I create a local cluster using 4 parallel processes (because I have a quadcore) and then set a default scheduling client that will use this local cluster to perform the Dask operations. With a CSV two columns file of 1.5 Gb, the standard groupby takes around 35 seconds on my laptop whereas the multiprocess one only takes around 22 seconds.

123

answered Sep 28 '22 17:09

Olivier CAYROL

Related questions
                            
                                Python 2.7 and Pandas Boxplot connecting median values
                            
                                Show group on every record in groupby
                            
                                KeyError when extracting data from a pandas.core.series.Series
                            
                                Auto-detect the delimiter in a CSV file using pd.read_csv
                            
                                Python Pandas Key Error When Trying to Access Index
                            
                                Most efficient way to groupby => aggregate for large dataframe in pandas
                            
                                Reading Excel file without hidden columns in Python using Pandas or other modules
                            
                                Is there a way to speed up the following pandas for loop?
                            
                                How to decide threshold value in SelectFromModel() for selecting features?
                            
                                How to select range of rows in Pandas?
                            
                                Pandas backfill specific value
                            
                                Is there a better/more efficient way to do this (vectorised)? Very slow performance with Pandas apply
                            
                                Pandas: shifting columns depending on if NaN or not
                            
                                How to use pandas to_csv float_format?
                            
                                Pandas to_numeric is not downcasting integer column
                            
                                Merge on one column or another
                            
                                word cloud does not show the frequency of the words correctly
                            
                                Fix the code to get rid of ValueError: cannot set using a multi-index selection indexer with a different length
                            
                                Drop Rows of an id after a particular column value in Pandas
                            
                                How to assign arbitrary metadata to pyarrow.Table / Parquet columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to parallelize groupby() in dask?

Tags:

pandas

parallel-processing

pandas-groupby

dask

Robin1988

People also ask

1 Answers

Olivier CAYROL

Recent Activity

Donate For Us