If I have a column in a data set that has multiple variables how would I go about creating these dummy variables.
Example: Lets say that I have a column named color it has: Red, Green, Yellow, Blue, Pink, and Grey as options for the color of a car.
What is the best way to turn these variables into factors. without creating a bunch of dummy variables by hand?
Edit: So I did what Greg recommended and this is what I have. I was wondering about the NA output though and was unsure why it is there.
 > data$Trim<-factor(data$Trim)
 > data$Model<-factor(data$Model)
 > data$Type<-factor(data$Type)
 > data=cbind(Price,Mileage,Buick,Cadillac,Chevrolet,Pontiac,SAAB,Saturn,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather)
 > fit <- lm( Price ~ Mileage+Buick+Cadillac+Chevrolet+Pontiac+SAAB+Saturn+Model+Trim+Type+Cylinder+Liter+Doors+Cruise+Sound+Leather, x=TRUE )
 > summary(fit)
Then I get a message "Coefficients: (21 not defined because of singularities)" and for some of the variables the output is NA.
R will create dummy variables for you automatically, here is a basic example:
> mycars <- mtcars
> mycars$cyl <- factor(mycars$cyl)
> fit <- lm( mpg ~ wt+cyl, data=mycars, x=TRUE )
> summary(fit)
Call:
lm(formula = mpg ~ wt + cyl, data = mycars, x = TRUE)
Residuals:
    Min      1Q  Median      3Q     Max 
-4.5890 -1.2357 -0.5159  1.3845  5.7915 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  33.9908     1.8878  18.006  < 2e-16 ***
wt           -3.2056     0.7539  -4.252 0.000213 ***
cyl6         -4.2556     1.3861  -3.070 0.004718 ** 
cyl8         -6.0709     1.6523  -3.674 0.000999 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Residual standard error: 2.557 on 28 degrees of freedom
Multiple R-squared: 0.8374,     Adjusted R-squared:  0.82 
F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11 
> head(fit$x)
                  (Intercept)    wt cyl6 cyl8
Mazda RX4                   1 2.620    1    0
Mazda RX4 Wag               1 2.875    1    0
Datsun 710                  1 2.320    0    0
Hornet 4 Drive              1 3.215    1    0
Hornet Sportabout           1 3.440    0    1
Valiant                     1 3.460    1    0
> 
The x=TRUE in the call to lm tells it to return the x matrix actually used, which includes the dummy variables.  If you don't want to look at the created dummy variables then you can leave that out.  See ?contrasts for more detail if you want to set how the dummy variables are created.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With