Random forests in R (empty classes in y and argument legth 0)

Tags:

I'm dealing for the first time with random forests and I'm having some troubles that I can't figure out.. When I run the analysis on all my dataset (about 3000 rows) I don't get any error message. But when I perform the same analysis on a subset of my dataset (about 300 rows) I get an error:

dataset <- read.csv("datasetNA.csv", sep=";", header=T)
names (dataset)
dataset2 <- dataset[complete.cases(dataset$response),]
library(randomForest)
dataset2 <- na.roughfix(dataset2)
data.rforest <- randomForest(dataset2$response ~ dataset2$predictorA + dataset2$predictorB+ dataset2$predictorC + dataset2$predictorD + dataset2$predictorE + dataset2$predictorF + dataset2$predictorG + dataset2$predictorH + dataset2$predictorI, data=dataset2, ntree=100, keep.forest=FALSE, importance=TRUE)

# subset of my original dataset:
groupA<-dataset2[dataset2$order=="groupA",]
data.rforest <- randomForest(groupA$response ~ groupA$predictorA + groupA$predictorB+ groupA$predictorC + groupA$predictorD + groupA$predictorE + groupA$predictorF + groupA$predictorG + groupA$predictorH + groupA$predictorI, data=groupA, ntree=100, keep.forest=FALSE, importance=TRUE)

Error in randomForest.default(m, y, ...) : Can't have empty classes in y.

However, my response variable hasn't any empty class.

If instead I write randomForest like this (a+b+c,y) instead than (y ~ a+b+c) I get this other message:

Error in if (n == 0) stop("data (x) has 0 rows") : 
  argument length zero
Warning messages:
1: In Ops.factor(groupA$responseA + groupA$responseB,  :
  + not meaningful for factors

The second problem is that when I try to impute my data through rfImpute() I get an error:

Errore in na.roughfix.default(x) :  roughfix can only deal with numeric data

However my columns are all factors and numeric.

Can somebody see where I'm wrong???

684

asked Nov 21 '12 14:11

user1842218

2 Answers

Based on the discussion in the comments, here's a guess at a potential solution.

The confusion here arises from the fact that the levels of a factor are an attribute of the variable. Those levels will remain the same, no matter what subset you take of the data, no matter how small that subset. This is a feature, not a bug, and a common source of confusion.

If you want to drop missing levels when subsetting, wrap your subset operation in droplevels():

groupA <- droplevels(dataset2[dataset2$order=="groupA",])

I should probably also add that many R users set options(stringsAsFactors = FALSE) when starting a new session (e.g. in their .Rprofile file) to avoid these kinds of hassles. The downside to doing this is that if you share your code with other people frequently, this can cause problems if they haven't altered R's default options.

answered Sep 18 '22 16:09

joran

When factor levels are removed by subsetting, you must reset levels:

levels(train11$str);
[1] "B" "D" "E" "G" "H" "I" "O" "T" "X" "Y" "b";
train11$str <- factor(train11$str);
levels(train11$str);
[1] "B" "D" "E" "G" "H" "I" "O" "T" "b"

answered Sep 20 '22 16:09

Robert Williams

Related questions
                            
                                Load a small random sample from a large csv file into R data frame
                            
                                Convert Factor to Date/Time in R
                            
                                Is it possible to push/pull variables between two instances of R?
                            
                                Extract last non-missing value in row with data.table
                            
                                R Plotly Deselect trace by default
                            
                                How to find the three closest (nearest) values within a vector?
                            
                                Saving a data frame as a binary file
                            
                                How to change points and add a regression to a cloudplot (using R)?
                            
                                ggplot2 offset scatterplot points
                            
                                What algorithm I need to find n-grams?
                            
                                Conditional coloring of cells in table
                            
                                Error ".onLoad failed in loadNamespace() for 'tcltk'"
                            
                                Iterating over characters of string R
                            
                                Trying to publish an R notebook and keep getting the same error (Error in contrib.url(repos, "source") trying to use CRAN without setting a mirror
                            
                                Efficiently change elements in data based on neighbouring elements
                            
                                How can I add annotations below the x axis in ggplot2?
                            
                                How to get ranks with no gaps when there are ties among values?
                            
                                How can I read the source code for an R function?
                            
                                creating a triangular matrix
                            
                                Writing the data frame to MySql DB table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With