Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel processing and temporary files

I'm using the mclapply function in the multicore package to do parallel processing. It seems that all child processes started produce the same names for temporary files given by the tempfile function. i.e. if I have four processors,

library(multicore)
mclapply(1:4, function(x) tempfile())

will give four exactly same filenames. Obviously I need the temporary files to be different so that the child processes don't overwrite each others' files. When using tempfile indirectly, i.e. calling some function that calls tempfile I have no control over the filename.

Is there a way around this? Do other parallel processing packages for R (e.g. foreach) have the same problem?

Update: This is no longer an issue since R 2.14.1.

CHANGES IN R VERSION 2.14.0 patched:

[...]

o tempfile() on a Unix-alike now takes the process ID into account.
  This is needed with multicore (and as part of parallel) because
  the parent and all the children share a session temporary
  directory, and they can share the C random number stream used to
  produce the uniaue part.  Further, two children can call
  tempfile() simultaneously.
like image 779
otsaw Avatar asked Mar 10 '11 16:03

otsaw


People also ask

What is the purpose of temporary files?

A temporary file is a file that is created to temporarily store information in order to free memory for other purposes, or to act as a safety net to prevent data loss when a program performs certain functions. For example, Word determines automatically where and when it needs to create temporary files.

What are examples of temporary files?

For example, Microsoft Windows and Windows programs often create a file with a . tmp file extension as a temporary file. Programs like Microsoft Word may create a temporary hidden file beginning with a tilde and a dollar sign (e.g., ~$example. doc) in the same directory as the document.

What are temporary files in software?

What are temporary files? Temporary files are used by your system to store data while running programs or creating permanent files, such as Word documents or Excel spreadsheets. In the event that information is lost, your system can use temporary files to recover data.

What are temporary files in Window?

Temporary files, also called temp or tmp files, are created by Windows or programs on your computer to hold data while a permanent file is being written or updated. The data will be transferred to a permanent file when the task is complete, or when the program is closed.


3 Answers

I believe multicore spins off a separate process for each subtask. If that assumption is correct, then you should be able to use Sys.getpid() to "seed" tempfile:

tempfile(pattern=paste("foo", Sys.getpid(), sep=""))
like image 97
Daniel Dickison Avatar answered Oct 18 '22 11:10

Daniel Dickison


Use the x in your function:

mclapply(1:4, function(x) tempfile(pattern=paste("file",x,"-",sep=""))
like image 30
Aaron left Stack Overflow Avatar answered Oct 18 '22 11:10

Aaron left Stack Overflow


Because the parallel jobs all run at the same time, and because the random seed comes from the system time, running four instances of tempfile in parallel will typically produce the same results (if you have 4 cores, that is. If you only have two cores, you'll get two pairs of identical temp file names).

Better to generate the tempfile names first and give them to your function as an argument:

filenames <- tempfile( rep("file",4) )
mclapply( filenames, function(x){})

If you're using someone else's function that has a tempfile call in it, then working the PID into the tempfile name by modifying the tempfile function, as previously suggested, is probably the simplest plan:

tempfile <- function( pattern = "file", tmpdir = tempdir(), fileext = ""){
   .Internal(tempfile(paste("pid", Sys.getpid(), pattern, sep=""), tmpdir, fileext))}
mclapply( 1:4, function(x) tempfile() )
like image 1
Tim Avatar answered Oct 18 '22 10:10

Tim