Switching from multiprocess to multithreaded Dask.DataFrame

Question

I have a question about using dask to parallelize my code. I have a pandas dataframe and 8 cores CPU. So I want apply some function row-wise. Here is example:

import dask.dataframe as dd
from dask.multiprocessing import get
# o - is pandas DataFrame
o['dist_center_from'] = dd.from_pandas(o, npartitions=8).map_partitions(lambda df: df.apply(lambda x: vincenty((x.fromlatitude, x.fromlongitude), center).km, axis=1)).compute(get=get)

That code run 8 CPU's simultaneously. Now, I have a problem that each process eats much memory, like the main process. So, I want to run it multi-threaded with shared memory. I tried, to change from dask.multiprocessing import get to from dask.threaded import get. But it doesn't use all of my CPUs, and I think it runs on single core.

MRocklin · Accepted Answer

Yes, this is the tradeoff between threads and processes:

Threads: only parallelizes well if you use non-python code (most of the Pandas API on numeric data other than apply)
Processes: requires copying data around between processes

Switching from multiprocess to multithreaded Dask.DataFrame

Tags:

python

pandas

dataframe

multithreading

dask

zhc

1 Answers

MRocklin

Recent Activity

Donate For Us

Switching from multiprocess to multithreaded Dask.DataFrame

Tags:

python

pandas

dataframe

multithreading

dask

zhc

1 Answers

MRocklin

Related questions

Recent Activity

Donate For Us