Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Switching from multiprocess to multithreaded Dask.DataFrame

I have a question about using dask to parallelize my code. I have a pandas dataframe and 8 cores CPU. So I want apply some function row-wise. Here is example:

import dask.dataframe as dd
from dask.multiprocessing import get
# o - is pandas DataFrame
o['dist_center_from'] = dd.from_pandas(o, npartitions=8).map_partitions(lambda df: df.apply(lambda x: vincenty((x.fromlatitude, x.fromlongitude), center).km, axis=1)).compute(get=get)

That code run 8 CPU's simultaneously. Now, I have a problem that each process eats much memory, like the main process. So, I want to run it multi-threaded with shared memory. I tried, to change from dask.multiprocessing import get to from dask.threaded import get. But it doesn't use all of my CPUs, and I think it runs on single core.

like image 302
zhc Avatar asked Mar 05 '23 21:03

zhc


1 Answers

Yes, this is the tradeoff between threads and processes:

  • Threads: only parallelizes well if you use non-python code (most of the Pandas API on numeric data other than apply)
  • Processes: requires copying data around between processes
like image 113
MRocklin Avatar answered Mar 12 '23 22:03

MRocklin