Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: how to use long vectors with randomForest?

One of the new features of R 3.0.0 was the introduction of long vectors. However, .C() and .Fortran() do not accept long vector inputs. On R-bloggers I find:

This is a precaution as it is very unlikely that existing code will have been written to handle long vectors (and the R wrappers often assume that length(x) is an integer)

I work with R-package randomForest and this package obviously needs .Fortran() since it crashes leaving the error message

Error in randomForest.default: long vectors (argument 20) are not supported in .Fortran

How to overcome this problem? I use randomForest 4.6-7 (built under R 3.0.2) on a Windows 7 64bit computer.

like image 690
user7417 Avatar asked Mar 17 '14 12:03

user7417


1 Answers

The only way to guarantee that your input data frame will be accepted by randomForest is to ensure that the vectors inside the data frame do not have length which exceeds 2^31 – 1 (i.e are not long). If you must start off with a data frame containing long vectors, then you would have the subset the data frame to achieve an acceptable dimension for the vectors. Here is one way you could subset a data frame to make it suitable for randomForest:

# given data frame 'df' with long vectors
maxDim <- 2^31 - 1;
df[1:maxDim, ]

However, there is a major problem with doing this which is that you would be throwing away all observations (i.e. features) appearing in rows 2^31 or higher. In practice, you probably do not need so many observations to run a random forest calculation. The easy workaround to your problem is to simply take a statistically valid sub sample of the original dataset with a size which does not exceed 2^31 - 1. Store the data using R vectors not of the long type, and your randomForest calculation should run without any issues.

like image 191
Tim Biegeleisen Avatar answered Nov 15 '22 21:11

Tim Biegeleisen