Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

randomForest in R: can fit model and use it for predictions without error, but tuneRF gives diff length error

Just messing around with UCI heart disease data: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data. Data is of the format:

A tibble: 6 x 14
    age   sex    cp trestbps  chol   fbs restecg thalach exang oldpeak
  <dbl> <dbl> <dbl>    <int> <int> <dbl>   <int>   <int> <int>   <dbl>
1    63     1     3      145   233     1       0     150     0     2.3
2    41     0     1      130   204     0       0     172     0     1.4

Growing/fitting the tree on the training set works great, as does using it for predictions on the test set. However, tuneRF gives the error:

Error in randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry,  : 
  length of response must be the same as predictors 

It's R 3.5.0 and randomForest 4.6-14.

Some notes you'll see in the code:

1) the tuneRF command is using subsets of the same dataset, so the class labels are the same

2) the "target" response variable has been converted to factor before training/test partitioning

I have a feeling it is related to the way I am subsetting, that the results are lists instead of dataframes, maybe? But I used the same approach for the earlier steps without error. I found an SO question regarding this before, but can't find it in my history/google now. Even if I could find it, I don't understand how it applies, since I used the same method of subsetting before without any problem.

Script:

library(tidyverse)
library(randomForest)

I've added the hungarian data, after imputing the missing values (and don't want to use response for imputation) by running:

hungar_heart <- cbind(impute(hungar_heart[,-14]),hungar_heart[,14])

I then add colnames to hungar_heart and add it to cleveland data:

hungar_heart<-setNames(hungar_heart, c("age","sex","cp","trestbps","chol","fbs","restecg","thalach","exang","oldpeak","slope","ca","thal","target"))
heart_total<-rbind(heart_data,hungar_heart)

heart_total$target <- as.factor(heart_total$target)

#Partition new combined dataset into training and test sets after setting seed (123)
set.seed(123)
indicator <- sample(2, nrow(heart_total), replace = TRUE, prob = c(.7,.3))
train <- heart_total[indicator==1,]
test <- heart_total[indicator==2,]

#Fit random forest to training set, using default values to start.  
forest <- randomForest(target~., data=train)

#Use trained model on test set
predict_try <- predict(forest, test)

#so far so good.  now tuneRF gives error:

tune_RF <- tuneRF(train[,-14],train[,14],
   stepFactor = 0.5,
   plot = TRUE,
   ntreeTry = 300,
   improve = 0.05)

Error in randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry,  : 
length of response must be the same as predictors
In addition: Warning message:
In randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry,  :
  The response has five or fewer unique values.  Are you sure you want to do regression?

#FWIW, length:

length(train[,-14])
[1] 13

length(train[,14])
[1] 1

I think it's probably just some uniqueness I didn't expect from my subsetting method.

Thanks

like image 796
userninenineninenine Avatar asked Oct 20 '25 01:10

userninenineninenine


1 Answers

Great - figured this out thanks to some help.

I should have explicitly included in my OP that I was using dplyr.

Turns out, although randomForest and predict on that random forest work fine on tibbles, tuneRF (or maybe tuneRF after the way I subsetted) expects a dataframe, and will throw an error otherwise.

V simple fix:

train <- as.data.frame(train)

Before tuneRF line.

like image 179
userninenineninenine Avatar answered Oct 22 '25 03:10

userninenineninenine