Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R model.matrix setup

Tags:

r

I have a data set that I use the model.matrix() function on to convert factor variables to dummy variables. My data has 10 columns like this each with 3 levels (2,3,4) and I've been creating dummy variables for each of them separately.

xFormData <- function(dataset){
    mm0 <- model.matrix(~ factor(dataset$type) , data=dataset)
    mm1 <- model.matrix(~ factor(dataset$type_last1), data = dataset)
    mm2 <- model.matrix(~ factor(dataset$type_last2), data = dataset)
    mm3 <- model.matrix(~ factor(dataset$type_last3), data = dataset)
    mm4 <- model.matrix(~ factor(dataset$type_last4), data = dataset)
    mm5 <- model.matrix(~ factor(dataset$type_last5), data = dataset)
    mm6 <- model.matrix(~ factor(dataset$type_last6), data = dataset)
    mm7 <- model.matrix(~ factor(dataset$type_last7), data = dataset)
    mm8 <- model.matrix(~ factor(dataset$type_last8), data = dataset)
    mm9 <- model.matrix(~ factor(dataset$type_last9), data = dataset)
    mm10 <- model.matrix(~ factor(dataset$type_last10), data = dataset)

    dataset <- cbind(dataset, mm0, mm1, mm2, mm3, mm4, mm5, mm6, mm7, mm8, mm9, mm10)

dataset
}

I'm wondering if this is the wrong procedure as after running a randomForest on the data, and plotting the variable importance, it was showing different dummy variable columns individually. So say columns 61-63 were the 3 dummy variables for column 10, the randomForest is seeing column 62 by itself as an important predictor.

I have 2 questions:

1) Is this ok?

2) If not, how can I group the dummy variables so that the rf knows they are together?

like image 617
screechOwl Avatar asked Oct 24 '22 09:10

screechOwl


1 Answers

This is OK, and is what happens behind the scenes anyway if you left the factors as factors. Different levels of a factor are different features for most machine learning purposes. Think of a random example like test outcome ~ school: Maybe going to school A is very predictive of whether you pass or fail the test, but not school B or school C. Then, the school A feature would be useful, but not the others.

This is covered in one of the caret vignette documents: http://cran.r-project.org/web/packages/caret/vignettes/caretMisc.pdf

Also, the cars data set included with caret should be a useful example. It contains 2 factors - "manufacturer" and "car type" - that have been dummy-coded into a series of numeric features for machine learning purposes.

data(cars, package='caret')
head(cars)
like image 106
John Colby Avatar answered Oct 27 '22 11:10

John Colby