Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create dummy variables?

Tags:

r

I have a variable that is a factor :

 $ year           : Factor w/ 8 levels "2003","2004",..: 4 6 4 2 4 1 3 3 7 2 ...

I would like to create 8 dummy variables, named "2003", "2004" etc that take the value 0 or 1 depending on the value that the variable "year" takes. The nearest I could come up with is

dt1 <- cbind (dt1, model.matrix(~dt1$year - 1) )

But this has the unfortunate consequences of

  1. The dummy variables are named dt1$year2003, not just "2003", "2004" etc
  2. It seems that NA rows are omitted altogether by model.matrix (so the above command fails due to different lengths when NA is present in the year variable).

Of course I can get around these problems with more code, but I like my code to be as concise as possible (within reason) so if anyone can suggest better ways to make the dummy variables I would be obliged.

like image 200
Joe King Avatar asked Oct 06 '12 08:10

Joe King


People also ask

What does it mean to create dummy variables?

A dummy variable is a numerical variable used in regression analysis to represent subgroups of the sample in your study. In research design, a dummy variable is often used to distinguish different treatment groups.

Do dummy variables have to be 0 and 1?

Dummy variables are independent variables which take the value of either 0 or 1. Just as a "dummy" is a stand-in for a real person, in quantitative analysis, a dummy variable is a numeric stand-in for a qualitative fact or a logical proposition.


2 Answers

This is as concise as I could get. The na.action option takes care of the NA values (I would rather do this with an argument than with a global options setting, but I can't see how). The naming of columns is pretty deeply hard-coded, don't see any way to override it within model.matrix ...

options(na.action=na.pass)
dt1 <- data.frame(year=factor(c(NA,2003:2005)))
dt2 <- setNames(cbind(dt1,model.matrix(~year-1,data=dt1)),
              c("year",levels(dt1$year)))

As pointed out above, you may run into trouble in some contexts with column names that are not legal R variable names.

  year 2003 2004 2005
1 <NA>   NA   NA   NA
2 2003    1    0    0
3 2004    0    1    0
4 2005    0    0    1
like image 51
Ben Bolker Avatar answered Oct 03 '22 06:10

Ben Bolker


You could use ifelse() which won't omit na rows (but I guess you might not count it as being "as concise as possible"):

dt1 <- data.frame(year=factor(rep(2003:2010, 10)))  # example data

dt1 <- within(dt1, yr2003<-ifelse(year=="2003", 1, 0))
dt1 <- within(dt1, yr2004<-ifelse(year=="2004", 1, 0))
dt1 <- within(dt1, yr2005<-ifelse(year=="2005", 1, 0))
# ...    

head(dt1)
#   year yr2003 yr2004 yr2005
# 1 2003      1      0      0
# 2 2004      0      1      0
# 3 2005      0      0      1
# 4 2006      0      0      0
# 5 2007      0      0      0
# 6 2008      0      0      0
like image 24
smillig Avatar answered Oct 03 '22 08:10

smillig