Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient Collapse Dummy Variables

Tags:

r

What is an efficient way (any solution including non-base packages welcomed) to collapse dummy variables back into a factor.

   race.White race.Hispanic race.Black race.Asian
1           1             0          0          0
2           0             0          0          1
3           1             0          0          0
4           0             0          1          0
5           0             0          0          1
6           0             1          0          0
7           1             0          0          0
8           1             0          0          0
9           1             0          0          0
10          0             0          1          0

Desired output:

       race
1     White
2     Asian
3     White
4     Black
5     Asian
6  Hispanic
7     White
8     White
9     White
10    Black

Data:

dat <- structure(list(race.White = c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 
1L, 0L), race.Hispanic = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 
0L), race.Black = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L), 
    race.Asian = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("race.White", 
"race.Hispanic", "race.Black", "race.Asian"), row.names = c(NA, 
-10L), class = "data.frame")

What I tried:

This is a possible solution but I am sure there's a better indexing/dplyr/data.table/.etc solution.

apply(dat, 1, function(x) sub("[^.]+\\.", "", colnames(dat))[x])
like image 243
Tyler Rinker Avatar asked Sep 16 '15 02:09

Tyler Rinker


People also ask

Can dummy variables be 1 and 2?

Indeed, a dummy variable can take values either 1 or 0. It can express either a binary variable (for instance, man/woman, and it's on you to decide which gender you encode to be 1 and which to be 0), or a categorical variables (for instance, level of education: basic/college/postgraduate).

What is a Dummie variable?

A dummy variable is a numerical variable used in regression analysis to represent subgroups of the sample in your study. In research design, a dummy variable is often used to distinguish different treatment groups.

Can dummy variables be greater than 1?

Yes, coefficients of dummy variables can be more than one or less than zero. Remember that you can interpret that coefficient as the mean change in your response (dependent) variable when the dummy changes from 0 to 1, holding all other variables constant (i.e. ceteris paribus).

What is the benchmark for a dummy variable?

Reference category in Dummy variableThe category that has the value of 0 is called the reference category, Benchmark or comparison category. So if there are many dummy variables, we must not forgot to keep an account of the reference category of each of the dummy variable during the interpretation.


2 Answers

We can use max.col to get the column index, subset the column names based on that and use sub to remove the prefix.

sub('[^.]+\\.', '', names(dat)[max.col(dat)])
#[1] "White"    "Asian"    "White"    "Black"    "Asian"    "Hispanic"
#[7] "White"    "White"    "White"    "Black"  

Here, I assumed that there is a single 1 per each row. If there are multiple 1s, we can use the option ties.method='first' or ties.method='last'.


Or another option is doing the %*% with the sequence of columns, subset the column names, and remove the prefix with sub.

 sub('[^.]+\\.', '', names(dat)[(as.matrix(dat) %*%seq_along(dat))[,1]])

Or we can use pmax

sub('[^.]+\\.', '', names(dat)[do.call(pmax,dat*seq_along(dat)[col(dat)])])
like image 111
akrun Avatar answered Sep 21 '22 15:09

akrun


Another idea:

ff = function(x)
{
    ans = integer(nrow(x))
    for(i in seq_along(x)) ans[as.logical(x[[i]])] = i
    names(x)[ans]
}                                    
sub("[^.]+\\.", "", ff(dat))
#[1] "White"    "Asian"    "White"    "Black"    "Asian"    "Hispanic" "White"    "White"    "White"    "Black"

And to compare with akrun's alternatives:

akrun1 = function(x) names(x)[max.col(x, "first")]
akrun2 = function(x) names(x)[(as.matrix(x) %*% seq_along(x))[, 1]]
akrun3 = function(x) names(x)[do.call(pmax, x * seq_along(x)[col(x)])]
akrunlike = function(x) names(x)[do.call(pmax, Map("*", x, seq_along(x)))]

DF = setNames(as.data.frame("[<-"(matrix(0L, 1e4, 1e3), 
                                  cbind(seq_len(1e4), sample(1e3, 1e4, TRUE)), 
                                  1L)), 
              paste("fac", 1:1e3, sep = ""))

identical(ff(DF), akrun1(DF))
#[1] TRUE
identical(ff(DF), akrun2(DF))
#[1] TRUE
identical(ff(DF), akrun3(DF))
#[1] TRUE
identical(ff(DF), akrunlike(DF))
#[1] TRUE
microbenchmark::microbenchmark(ff(DF), akrun1(DF), akrun2(DF), 
                               akrun3(DF), akrunlike(DF), 
                               as.matrix(DF), col(DF), times = 30)
#Unit: milliseconds
#          expr        min         lq     median         uq        max neval
#        ff(DF)   61.99124   64.56194   78.62267  102.18424  152.64891    30
#    akrun1(DF)  296.89042  314.28641  327.95059  353.46185  394.46013    30
#    akrun2(DF)  103.76105  114.01497  120.12191  129.86513  166.13266    30
#    akrun3(DF) 1141.46478 1163.96842 1178.92961 1203.83848 1231.70346    30
# akrunlike(DF)  125.47542  130.20826  141.66123  157.92743  203.42331    30
# as.matrix(DF)   19.46940   20.54543   28.22377   35.69575   87.06001    30
#       col(DF)  103.61454  112.75450  116.00120  126.09138  176.97435    30

I included as.matrix() and col() just to show that "list"-y structures can be convenient on efficient looping as is. E.g., in contrast to a by-row looping, a way to use by-column looping doesn't need time to transform the structure of data.

like image 25
alexis_laz Avatar answered Sep 21 '22 15:09

alexis_laz