Just messing around with UCI heart disease data: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data. Data is of the format:
A tibble: 6 x 14
age sex cp trestbps chol fbs restecg thalach exang oldpeak
<dbl> <dbl> <dbl> <int> <int> <dbl> <int> <int> <int> <dbl>
1 63 1 3 145 233 1 0 150 0 2.3
2 41 0 1 130 204 0 0 172 0 1.4
Growing/fitting the tree on the training set works great, as does using it for predictions on the test set. However, tuneRF gives the error:
Error in randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry, :
length of response must be the same as predictors
It's R 3.5.0 and randomForest 4.6-14.
Some notes you'll see in the code:
1) the tuneRF command is using subsets of the same dataset, so the class labels are the same
2) the "target" response variable has been converted to factor before training/test partitioning
I have a feeling it is related to the way I am subsetting, that the results are lists instead of dataframes, maybe? But I used the same approach for the earlier steps without error. I found an SO question regarding this before, but can't find it in my history/google now. Even if I could find it, I don't understand how it applies, since I used the same method of subsetting before without any problem.
Script:
library(tidyverse)
library(randomForest)
I've added the hungarian data, after imputing the missing values (and don't want to use response for imputation) by running:
hungar_heart <- cbind(impute(hungar_heart[,-14]),hungar_heart[,14])
I then add colnames to hungar_heart and add it to cleveland data:
hungar_heart<-setNames(hungar_heart, c("age","sex","cp","trestbps","chol","fbs","restecg","thalach","exang","oldpeak","slope","ca","thal","target"))
heart_total<-rbind(heart_data,hungar_heart)
heart_total$target <- as.factor(heart_total$target)
#Partition new combined dataset into training and test sets after setting seed (123)
set.seed(123)
indicator <- sample(2, nrow(heart_total), replace = TRUE, prob = c(.7,.3))
train <- heart_total[indicator==1,]
test <- heart_total[indicator==2,]
#Fit random forest to training set, using default values to start.
forest <- randomForest(target~., data=train)
#Use trained model on test set
predict_try <- predict(forest, test)
#so far so good. now tuneRF gives error:
tune_RF <- tuneRF(train[,-14],train[,14],
stepFactor = 0.5,
plot = TRUE,
ntreeTry = 300,
improve = 0.05)
Error in randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry, :
length of response must be the same as predictors
In addition: Warning message:
In randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry, :
The response has five or fewer unique values. Are you sure you want to do regression?
#FWIW, length:
length(train[,-14])
[1] 13
length(train[,14])
[1] 1
I think it's probably just some uniqueness I didn't expect from my subsetting method.
Thanks
Great - figured this out thanks to some help.
I should have explicitly included in my OP that I was using dplyr.
Turns out, although randomForest and predict on that random forest work fine on tibbles, tuneRF (or maybe tuneRF after the way I subsetted) expects a dataframe, and will throw an error otherwise.
V simple fix:
train <- as.data.frame(train)
Before tuneRF line.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With