I have a question about using dask to parallelize my code. I have a pandas dataframe and 8 cores CPU. So I want apply some function row-wise. Here is example:
import dask.dataframe as dd
from dask.multiprocessing import get
# o - is pandas DataFrame
o['dist_center_from'] = dd.from_pandas(o, npartitions=8).map_partitions(lambda df: df.apply(lambda x: vincenty((x.fromlatitude, x.fromlongitude), center).km, axis=1)).compute(get=get)
That code run 8 CPU's simultaneously. Now, I have a problem that each process eats much memory, like the main process. So, I want to run it multi-threaded with shared memory. I tried, to change from dask.multiprocessing import get
to from dask.threaded import get
. But it doesn't use all of my CPUs, and I think it runs on single core.
Yes, this is the tradeoff between threads and processes:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With