Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Minimum number of observation when performing Random Forest

Is it possible to apply RandomForests to very small datasets? I have a dataset with many variables but only 25 observation each. Random forests produce reasonable results with low OOB errors (10-25%). Is there any rule of thumb regarding the minimum number of observations to use? In fact one of the response variable is unbalanced, and if I'm going to subsample it I will end up with an even smaller number of observations. Thanks in advance

like image 324
Oritteropus Avatar asked Jul 09 '13 09:07

Oritteropus


People also ask

How many samples do you need for random forest?

For testing, 10 is enough but to achieve robust results, you can increase it up to 100 or 500. This however only makes sense if you have more than 8 input rasters, otherwise the training data is always the same, even if you repeat it 1000 times.

Does random forest use all observations?

Random forest randomly selects observations, builds a decision tree and the average result is taken. It doesn't use any set of formulas.

Can random forest be used for small datasets?

Conclusion: In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests.

Does random forest require a lot of data?

Because random forest uses many decision trees, it can require a lot of memory on larger projects. This can make it slower than some other, more efficient, algorithms. Sometimes, because this is a decision tree-based method and decision trees often suffer from overfitting, this problem can affect the overall forest.


1 Answers

Absolutely RF can be used on these type of datasets (i.e. p>n). In fact they use RF in fields like genomics where the number of fields >= 20000 and there are only a very small number of rows - say 10-12. The entire problem is figuring out which of the 20k variables would make up a parsimonious marker (i.e. feature selection is the entire problem).

I don't have any ROTs about minimum size other than if your model doesn't work well on a held back sample (or Hold-One-Back cross validation might work well in your case) well then you should try something else.

Hope this helps

like image 108
Wake2Sleep Avatar answered Oct 21 '22 13:10

Wake2Sleep