I have a data set that I use the model.matrix()
function on to convert factor variables to dummy variables. My data has 10 columns like this each with 3 levels (2,3,4) and I've been creating dummy variables for each of them separately.
xFormData <- function(dataset){
mm0 <- model.matrix(~ factor(dataset$type) , data=dataset)
mm1 <- model.matrix(~ factor(dataset$type_last1), data = dataset)
mm2 <- model.matrix(~ factor(dataset$type_last2), data = dataset)
mm3 <- model.matrix(~ factor(dataset$type_last3), data = dataset)
mm4 <- model.matrix(~ factor(dataset$type_last4), data = dataset)
mm5 <- model.matrix(~ factor(dataset$type_last5), data = dataset)
mm6 <- model.matrix(~ factor(dataset$type_last6), data = dataset)
mm7 <- model.matrix(~ factor(dataset$type_last7), data = dataset)
mm8 <- model.matrix(~ factor(dataset$type_last8), data = dataset)
mm9 <- model.matrix(~ factor(dataset$type_last9), data = dataset)
mm10 <- model.matrix(~ factor(dataset$type_last10), data = dataset)
dataset <- cbind(dataset, mm0, mm1, mm2, mm3, mm4, mm5, mm6, mm7, mm8, mm9, mm10)
dataset
}
I'm wondering if this is the wrong procedure as after running a randomForest
on the data, and plotting the variable importance, it was showing different dummy variable columns individually. So say columns 61-63 were the 3 dummy variables for column 10, the randomForest
is seeing column 62 by itself as an important predictor.
I have 2 questions:
1) Is this ok?
2) If not, how can I group the dummy variables so that the rf knows they are together?
This is OK, and is what happens behind the scenes anyway if you left the factors as factors. Different levels of a factor are different features for most machine learning purposes. Think of a random example like test outcome ~ school
: Maybe going to school A is very predictive of whether you pass or fail the test, but not school B or school C. Then, the school A feature would be useful, but not the others.
This is covered in one of the caret
vignette documents: http://cran.r-project.org/web/packages/caret/vignettes/caretMisc.pdf
Also, the cars
data set included with caret
should be a useful example. It contains 2 factors - "manufacturer" and "car type" - that have been dummy-coded into a series of numeric features for machine learning purposes.
data(cars, package='caret')
head(cars)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With