Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create new dummy variable columns from categorical variable

Tags:

r

I have a several data sets with 75,000 observations and a type variable that can take on a value 0-4. I want to add five new dummy variables to each data set for all types. The best way I could come up with to do this is as follows:

# For the 'binom' data set create dummy variables for all types in all data sets binom.dummy.list<-list() for(i in 0:4){     binom.dummy.list[[i+1]]<-sapply(binom$type,function(t) ifelse(t==i,1,0)) }  # Add and merge data binom.dummy.df<-as.data.frame(do.call("cbind",binom.dummy.list)) binom.dummy.df<-transform(binom.dummy.df,id=1:nrow(binom)) binom<-merge(binom,binom.dummy.df,by="id") 

While this works, it is incredibly slow (the merge function has even crashed a few times). Is there a more efficient way to do this? Perhaps this functionality is part of a package that I am not familiar with?

like image 841
DrewConway Avatar asked Aug 02 '10 01:08

DrewConway


People also ask

How do you convert categorical variables to dummy variables?

To convert your categorical variables to dummy variables in Python you c an use Pandas get_dummies() method. For example, if you have the categorical variable “Gender” in your dataframe called “df” you can use the following code to make dummy variables: df_dc = pd. get_dummies(df, columns=['Gender']) .

How do I convert categorical data to dummy variables in R?

To convert category variables to dummy variables in tidyverse, use the spread() method. To do so, use the spread() function with three arguments: key, which is the column to convert into categorical values, in this case, “Reporting Airline”; value, which is the value you want to set the key to (in this case “dummy”);


1 Answers

R has a "sub-language" to translate formulas into design matrix, and in the spirit of the language you can take advantage of it. It's fast and concise. Example: you have a cardinal predictor x, a categorical predictor catVar, and a response y.

> binom <- data.frame(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE))) > head(binom)           y          x catVar 1 0.5051653 0.34888390      2 2 0.4868774 0.85005067      2 3 0.3324482 0.58467798      2 4 0.2966733 0.05510749      3 5 0.5695851 0.96237936      1 6 0.8358417 0.06367418      2 

You just do

> A <- model.matrix(y ~ x + catVar,binom)  > head(A)   (Intercept)          x catVar1 catVar2 catVar3 catVar4 1           1 0.34888390       0       1       0       0 2           1 0.85005067       0       1       0       0 3           1 0.58467798       0       1       0       0 4           1 0.05510749       0       0       1       0 5           1 0.96237936       1       0       0       0 6           1 0.06367418       0       1       0       0 

Done.

like image 139
gappy Avatar answered Oct 14 '22 22:10

gappy