parallelize process in missForest package

Question

I am using a package called missForest to estimate the missing values in my data set. My question is: how can we parallelize this process to shorten the time that it takes to get the results? Please refer to this example (from missForest package):

 data(iris)
 summary(iris)

The data contains four continuous and one categorical variable. Artificially produce missing values using the prodNA function:

set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)
summary(iris.mis)

Impute missing values providing the complete matrix for illustration. Use ’verbose’ to see what happens between iterations:

iris.imp <- missForest(iris.mis, xtrue = iris, verbose = TRUE)

Daniel Stekhoven · Accepted Answer

Yesterday I submitted version 1.4 of missForest to CRAN; the Windows and Linux packages are ready, the Mac version will follow soon.

The new function has an additional argument "parallelize" which allows to either compute the single forests in a parallel fashion (parallelize="forests") or to compute several forests on multiple variables at the same time (parallelize="variables"). The default setting is without parallel computing (parallelize="no").

Do not forget to register a suitable parallel backend, e.g. using the package "doParallel", before trying it for the first time. The "doParallel" vignette gives an illustrative example in Section 4.

Due to some other details I had to temporarily remove the "missForest" vignette from the package. But I will resolve this in due course and release it as version 1.4-1.

Steve Weston · Answer

It's a bit tricky to do a good job of parallelizing the missForest function. There seem to be two basic ways to do it:

Create the randomForest model objects in parallel;
Execute multiple randomForest operations (create model and predict) in parallel for each of the columns of the data frame that contain NA's.

Method 1 is rather easy to implement, except that you have to compute the error estimates yourself since the randomForest combine function doesn't compute them for you. However, if the randomForest objects don't take that long to compute and there are many columns containing NA's, you may get very little if any speed up, even though the operations in aggregate take a long time to compute.

Method 2 is a bit harder to implement because the sequential algorithm updates the columns of the xmis data frame after each randomForest operation. I think the right way to parallelize this is to process n columns in parallel at a time (where n is the number of worker processes), thus requiring another loop around the n columns in order to process all of the columns of the data frame. My experiments suggest that unless this is done, the outer loop takes longer to converge, thus losing the benefit of executing in parallel.

In general, to get a performance improvement you will need to implement both of these methods, and choose which to use based on your input data. If you just have a few columns with NA's but the randomForest models take a long time to compute, you should choose method 1. If you have many columns with NA's, you should probably choose method 2, even if the individual randomForest models take a long time to compute because this can be done more efficiently, although it's possible that it will still require an extra iteration of the outer while loop.

In the process of experimenting with missForest, I eventually developed a parallel version of the package. I put the modified version of library.R on GitHub Gist, however it isn't trivial to use in that form, especially without documentation. So I contacted the author of missForest, and he is very interested in incorporating at least some of my modifications into the official package, so hopefully the next version of missForest that is posted to CRAN will support parallel execution.

parallelize process in missForest package

Tags:

r

parallel-processing

hema

2 Answers

Daniel Stekhoven

Steve Weston

Recent Activity

Donate For Us

parallelize process in missForest package

Tags:

r

parallel-processing

hema

2 Answers

Daniel Stekhoven

Steve Weston

Related questions

Recent Activity

Donate For Us