Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating indicator variable columns in dplyr chain

Tags:

r

dplyr

tidyr

Updated: With apologies to those who replied, in my original example I overlooked the fact that data.frame() created var as a factor rather than as a character vector, as I had intended. I have corrected the example, and this will break at least one of the answers.

--original--

I have a data frame that I'm performing a series of dplyr and tidyr manipulations on, and I would like to add columns for indicator variables that would be encoded as 0 or 1, and do this within the dplyr chain. Each level of a factor (presently stored as character vectors) should be encoded in a separate column, and the column names are a concatenation of a fixed prefix with the variable level, e.g. var has level a, new column var_a will be 1, and all other rows of var_a will be 0.

The following minimal example using base R produces exactly the results that I want (thanks to this blog post), but I'd like to roll it all into the dplyr chain, and can't quite figure out how to do it.

library(dplyr)
df <- data.frame(var = sample(x = letters[1:4], size = 10, replace = TRUE), stringsAsFactors = FALSE)
for(level in unique(df$var)){
  df[paste("var", level, sep = "_")] <- ifelse(df$var == level, 1, 0)
}

Note that the real data set contains multiple columns, none of which should be altered or dropped when creating the indicator variables, with the exception of the column var, which could be converted to type factor.

like image 510
Tom Avatar asked Mar 11 '16 15:03

Tom


1 Answers

It's not pretty, but this function should work

dummy <- function(data, col) {
    for(c in col) {
        idx <- which(names(data)==c)
        v <- data[[idx]]
        stopifnot(class(v)=="factor")
        m <- matrix(0, nrow=nrow(data), ncol=nlevels(v))
        m[cbind(seq_along(v), as.integer(v))]<-1
        colnames(m) <- paste(c, levels(v), sep="_")
        r <- data.frame(m)
        if ( idx>1 ) {
            r <- cbind(data[1:(idx-1)],r)
        }
        if ( idx<ncol(data) ) {
            r <- cbind(r, data[(idx+1):ncol(data)])
        }
        data <- r
    }
    data
}

Here's a sample data.frame

dd <- data.frame(a=runif(30),
    b=sample(letters[1:3],30,replace=T),
    c=rnorm(30),
    d=sample(letters[10:13],30,replace=T)
)

and you specify the columns you want to expand as a character vector. You can do

dd %>% dummy("b")

or

dd %>% dummy(c("b","d"))
like image 119
MrFlick Avatar answered Nov 01 '22 19:11

MrFlick