dask job killed because memory usage?

Question

Hi I have a python script that uses dask library to handle a very large data frame, larger than the physical memory. I notice that the job get killed in the middle of a run if the memory usage stays at 100% of the computer for some time.

Is it expected? I would thought the data would be spilled to disk and there are plenty of disk space left.

Is there a way to limit its total memory usage? Thanks

EDIT:

I also tried:

dask.set_options(available_memory=12e9)

It did not work. It did not seemed to limit its memory usage. Again, when memory usage reach 100%, the job gets killed.

mdurant · Accepted Answer

The line

 ddf = ddf.set_index("sort_col").compute()

is actually pulling the whole dataframe into memory and converting to pandas. You want to remove the .compute(), and apply whatever logic (filtering, groupby/aggregations, etc.) that you want first, before calling compute to produce a result that is small enough.

The important thing to remember, is that the resultant output must be able to fit into memory, and each chunk that is being processed by each worker (plus overheads) also needs to be able to fit into memory.

dask job killed because memory usage?

Tags:

performance

python

dask

Bo Qiang

1 Answers

mdurant

Recent Activity

Donate For Us

dask job killed because memory usage?

Tags:

performance

python

dask

Bo Qiang

1 Answers

mdurant

Related questions

Recent Activity

Donate For Us