Hi I have a python script that uses dask library to handle a very large data frame, larger than the physical memory. I notice that the job get killed in the middle of a run if the memory usage stays at 100% of the computer for some time.
Is it expected? I would thought the data would be spilled to disk and there are plenty of disk space left.
Is there a way to limit its total memory usage? Thanks
EDIT:
I also tried:
dask.set_options(available_memory=12e9)
It did not work. It did not seemed to limit its memory usage. Again, when memory usage reach 100%, the job gets killed.
The line
ddf = ddf.set_index("sort_col").compute()
is actually pulling the whole dataframe into memory and converting to pandas. You want to remove the .compute(), and apply whatever logic (filtering, groupby/aggregations, etc.) that you want first, before calling compute to produce a result that is small enough.
The important thing to remember, is that the resultant output must be able to fit into memory, and each chunk that is being processed by each worker (plus overheads) also needs to be able to fit into memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With