I am using a package called missForest to estimate the missing values in my data set. My question is: how can we parallelize this process to shorten the time that it takes to get the results? Please refer to this example (from missForest package):
data(iris)
summary(iris)
The data contains four continuous and one categorical variable.
Artificially produce missing values using the prodNA function:
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)
summary(iris.mis)
Impute missing values providing the complete matrix for illustration. Use ’verbose’ to see what happens between iterations:
iris.imp <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
Yesterday I submitted version 1.4 of missForest to CRAN; the Windows and Linux packages are ready, the Mac version will follow soon.
The new function has an additional argument "parallelize" which allows to either compute the single forests in a parallel fashion (parallelize="forests") or to compute several forests on multiple variables at the same time (parallelize="variables"). The default setting is without parallel computing (parallelize="no").
Do not forget to register a suitable parallel backend, e.g. using the package "doParallel", before trying it for the first time. The "doParallel" vignette gives an illustrative example in Section 4.
Due to some other details I had to temporarily remove the "missForest" vignette from the package. But I will resolve this in due course and release it as version 1.4-1.
It's a bit tricky to do a good job of parallelizing the missForest function. There seem to be two basic ways to do it:
Method 1 is rather easy to implement, except that you have to compute the error estimates yourself since the randomForest combine function doesn't compute them for you. However, if the randomForest objects don't take that long to compute and there are many columns containing NA's, you may get very little if any speed up, even though the operations in aggregate take a long time to compute.
Method 2 is a bit harder to implement because the sequential algorithm updates the columns of the xmis data frame after each randomForest operation. I think the right way to parallelize this is to process n columns in parallel at a time (where n is the number of worker processes), thus requiring another loop around the n columns in order to process all of the columns of the data frame. My experiments suggest that unless this is done, the outer loop takes longer to converge, thus losing the benefit of executing in parallel.
In general, to get a performance improvement you will need to implement both of these methods, and choose which to use based on your input data. If you just have a few columns with NA's but the randomForest models take a long time to compute, you should choose method 1. If you have many columns with NA's, you should probably choose method 2, even if the individual randomForest models take a long time to compute because this can be done more efficiently, although it's possible that it will still require an extra iteration of the outer while loop.
In the process of experimenting with missForest, I eventually developed a parallel version of the package. I put the modified version of library.R on GitHub Gist, however it isn't trivial to use in that form, especially without documentation. So I contacted the author of missForest, and he is very interested in incorporating at least some of my modifications into the official package, so hopefully the next version of missForest that is posted to CRAN will support parallel execution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With