Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask: very low CPU usage and multiple threads? is this expected?

I am using dask as in how to parallelize many (fuzzy) string comparisons using apply in Pandas?

Basically I do some computations (without writing anything to disk) that invoke Pandas and Fuzzywuzzy (that may not be releasing the GIL apparently, if that helps) and I run something like:

dmaster = dd.from_pandas(master, npartitions=4)
dmaster = dmaster.assign(my_value=dmaster.original.apply(lambda x: helper(x, slave), name='my_value'))
dmaster.compute(get=dask.multiprocessing.get)

However, a variant of the code has been running for 10 hours now, and is not over yet. I notice in windows task manager that

  • RAM utilization is pretty low, corresponding to the size of my data
  • CPU usage bounces from 0% to up to 5% every 2/3 seconds or so
  • I have about 20 Python processes whose size is 100MB, and one Python process that likely contains the data that is 30GB in size (I have a 128 GB machine with a 8 core CPU)

Question is: is that behavior expected? Am I obviously terribly wrong in setting some dask options here?

Of course, I understand the specifics depends on what exactly I am doing, but maybe the patterns above can already tell that something is horribly wrong?

Many thanks!!

like image 281
ℕʘʘḆḽḘ Avatar asked Oct 18 '22 06:10

ℕʘʘḆḽḘ


1 Answers

Of course, I understand the specifics depends on what exactly I am doing, but maybe the patterns above can already tell that something is horribly wrong?

This is pretty spot on. Identifying performance issues is tricky, especially when parallel computing comes into play. Here are some things that come to mind.

  1. The multiprocessing scheduler has to move data between different processes between every time. The serialization/deserialization cycle could be quite expensive. Using the distributed scheduler would handle this better.
  2. Your function helper could be doing something oddly.
  3. Generally using apply, even in Pandas, is best to be avoided.

Generally a good way to pin down these problems is to create a minimal, complete, verifiable example to share that others can reproduce and play with easily. Often in when creating such an example you find the solution to your problem anyway. But if this doesn't happen at least you can then pass the buck on to the library maintainer. Until such an example is created most library maintainers don't bother to spend their time, there is almost always too many details specific to the problem at hand to warrant free service.

like image 149
MRocklin Avatar answered Oct 22 '22 11:10

MRocklin