Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dask job killed because memory usage?

Hi I have a python script that uses dask library to handle a very large data frame, larger than the physical memory. I notice that the job get killed in the middle of a run if the memory usage stays at 100% of the computer for some time.

Is it expected? I would thought the data would be spilled to disk and there are plenty of disk space left.

Is there a way to limit its total memory usage? Thanks

EDIT:

I also tried:

dask.set_options(available_memory=12e9)

It did not work. It did not seemed to limit its memory usage. Again, when memory usage reach 100%, the job gets killed.

like image 815
Bo Qiang Avatar asked Oct 24 '25 16:10

Bo Qiang


1 Answers

The line

 ddf = ddf.set_index("sort_col").compute()

is actually pulling the whole dataframe into memory and converting to pandas. You want to remove the .compute(), and apply whatever logic (filtering, groupby/aggregations, etc.) that you want first, before calling compute to produce a result that is small enough.

The important thing to remember, is that the resultant output must be able to fit into memory, and each chunk that is being processed by each worker (plus overheads) also needs to be able to fit into memory.

like image 101
mdurant Avatar answered Oct 26 '25 05:10

mdurant