Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R unable to process heavy tasks for many hours

I have a list [~90 files] of zipped files . I have written a loop to unzip them (become 1Gb approx. per file), do some computations, save the output for each of the files and delete the unzipped file. One iteration of this process takes like 30-60min per file [not all files are the same size exactly].

I am not too worried about the time as I can leave it working over the weekend. However, R doesn’t manage to get all the way through. I left it on Friday night and it was only running 12 hours so it only processed 30 of the 90 files.

I don’t deal with this type of heavy processes often but the same has happened in the past with analogous processes. Is there any command I need to insert in my loops to avoid the computer from freezing with this intensive processes? I tried gc() at the end of the loop to no avail.

Is there a list of “good practice” recommendations for this type of procedures?

like image 389
user3507584 Avatar asked Feb 04 '19 10:02

user3507584


1 Answers

If your session is freezing you are likely running into a problem you need to isolate as it may be a single file, or maybe you are becoming restricted by memory or extensively using swap.

Regardless, here are some tips or ideas you could implement:

  • Writing your code to process a file as a singular case, e.g. a function like process_gz_folder(). Then loop over the file paths and invoke the function you created each time, this keeps the global environment clean.
  • As you already tried, sometimes gc() can help, but it depends on the situation and if memory is being cleared (after running rm() for example). Could be used after invoking function in first point.
  • Are you keeping the results of each folder in memory? Does this set of results get larger with each iteration? If so this may be taking up required memory - storing the results to disk as a suitable format will let you accumulate the results after each has been processed.
  • To add to the prior point, if files produce outputs making sure their names are appropriate and even adding a timestamp (e.g. inputfile_results_YYYYMMDD).
  • Code could check if file is already processed and skip to next, this can help restarting from scratch, especially if your method for checking if a file is processed is using the existence of an output (with timestamp!).
  • Using try() to make sure failures do not stop future iterations - however this should produce warnings/output to notify of a failure so that you can come back at a later point.
  • An abstract approach could be to create a single script that processes a single file, it could just include the function from the first point, proceeded with setTimeLimit() and provide a time for which if the file is not processed the code will stop running. Iterate over this script with a bash script invoking said R script with Rscript which can be passed arguments (filepaths for example). This approach may help avoid freezes but is dependent on you knowing and setting an acceptable time.
  • Determine if the files are too large for memory when processing the code may need be adjusted to be more memory efficient or change code to process the data incrementally as to not run out of memory.
  • Reduce other tasks on the computer that can take resources that may cause a freeze.

These are just some ideas that spring to mind that could be things to consider in your example (given the info provided). It would help to see some code and understand what kind of processing you are doing on each file.

like image 142
zacdav Avatar answered Sep 24 '22 03:09

zacdav