Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Model with a lot of dummy variables

If I have a column in a data set that has multiple variables how would I go about creating these dummy variables.

Example: Lets say that I have a column named color it has: Red, Green, Yellow, Blue, Pink, and Grey as options for the color of a car.

What is the best way to turn these variables into factors. without creating a bunch of dummy variables by hand?

Edit: So I did what Greg recommended and this is what I have. I was wondering about the NA output though and was unsure why it is there.

 > data$Trim<-factor(data$Trim)
 > data$Model<-factor(data$Model)
 > data$Type<-factor(data$Type)
 > data=cbind(Price,Mileage,Buick,Cadillac,Chevrolet,Pontiac,SAAB,Saturn,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather)
 > fit <- lm( Price ~ Mileage+Buick+Cadillac+Chevrolet+Pontiac+SAAB+Saturn+Model+Trim+Type+Cylinder+Liter+Doors+Cruise+Sound+Leather, x=TRUE )
 > summary(fit)

Then I get a message "Coefficients: (21 not defined because of singularities)" and for some of the variables the output is NA.

like image 799
John Avatar asked Dec 21 '22 13:12

John


1 Answers

R will create dummy variables for you automatically, here is a basic example:

> mycars <- mtcars
> mycars$cyl <- factor(mycars$cyl)
> fit <- lm( mpg ~ wt+cyl, data=mycars, x=TRUE )
> summary(fit)

Call:
lm(formula = mpg ~ wt + cyl, data = mycars, x = TRUE)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5890 -1.2357 -0.5159  1.3845  5.7915 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  33.9908     1.8878  18.006  < 2e-16 ***
wt           -3.2056     0.7539  -4.252 0.000213 ***
cyl6         -4.2556     1.3861  -3.070 0.004718 ** 
cyl8         -6.0709     1.6523  -3.674 0.000999 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 2.557 on 28 degrees of freedom
Multiple R-squared: 0.8374,     Adjusted R-squared:  0.82 
F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11 

> head(fit$x)
                  (Intercept)    wt cyl6 cyl8
Mazda RX4                   1 2.620    1    0
Mazda RX4 Wag               1 2.875    1    0
Datsun 710                  1 2.320    0    0
Hornet 4 Drive              1 3.215    1    0
Hornet Sportabout           1 3.440    0    1
Valiant                     1 3.460    1    0
> 

The x=TRUE in the call to lm tells it to return the x matrix actually used, which includes the dummy variables. If you don't want to look at the created dummy variables then you can leave that out. See ?contrasts for more detail if you want to set how the dummy variables are created.

like image 71
Greg Snow Avatar answered Jan 13 '23 13:01

Greg Snow