What is an efficient way (any solution including non-base packages welcomed) to collapse dummy variables back into a factor. <pre class="prettyprint"><code> race.White race.Hispanic race.Black race.Asian 1 1 0 0 0 2 0 0 0 1 3 1 0 0 0 4 0 0 1 0 5 0 0 0 1 6 0 1 0 0 7 1 0 0 0 8 1 0 0 0 9 1 0 0 0 10 0 0 1 0 </code></pre> Desired output: <pre class="prettyprint"><code> race 1 White 2 Asian 3 White 4 Black 5 Asian 6 Hispanic 7 White 8 White 9 White 10 Black </code></pre> Data: <pre class="prettyprint"><code>dat <- structure(list(race.White = c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L), race.Hispanic = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), race.Black = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L), race.Asian = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("race.White", "race.Hispanic", "race.Black", "race.Asian"), row.names = c(NA, -10L), class = "data.frame") </code></pre> What I tried: This is a possible solution but I am sure there's a better indexing/dplyr/data.table/.etc solution. <pre class="prettyprint"><code>apply(dat, 1, function(x) sub("[^.]+\\.", "", colnames(dat))[x]) </code></pre>

We can use <code>max.col</code> to get the column index, subset the column names based on that and use <code>sub</code> to remove the prefix. <pre class="prettyprint"><code>sub('[^.]+\\.', '', names(dat)[max.col(dat)]) #[1] "White" "Asian" "White" "Black" "Asian" "Hispanic" #[7] "White" "White" "White" "Black" </code></pre> Here, I assumed that there is a single <code>1</code> per each row. If there are multiple 1s, we can use the option <code>ties.method='first'</code> or <code>ties.method='last'</code>. <hr> Or another option is doing the <code>%*%</code> with the sequence of columns, subset the column names, and remove the prefix with <code>sub</code>. <pre class="prettyprint"><code> sub('[^.]+\\.', '', names(dat)[(as.matrix(dat) %*%seq_along(dat))[,1]]) </code></pre> <hr> Or we can use <code>pmax</code> <pre class="prettyprint"><code>sub('[^.]+\\.', '', names(dat)[do.call(pmax,dat*seq_along(dat)[col(dat)])]) </code></pre>

Efficient Collapse Dummy Variables

Tags:

r

What is an efficient way (any solution including non-base packages welcomed) to collapse dummy variables back into a factor.

   race.White race.Hispanic race.Black race.Asian
1           1             0          0          0
2           0             0          0          1
3           1             0          0          0
4           0             0          1          0
5           0             0          0          1
6           0             1          0          0
7           1             0          0          0
8           1             0          0          0
9           1             0          0          0
10          0             0          1          0

Desired output:

       race
1     White
2     Asian
3     White
4     Black
5     Asian
6  Hispanic
7     White
8     White
9     White
10    Black

Data:

dat <- structure(list(race.White = c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 
1L, 0L), race.Hispanic = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 
0L), race.Black = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L), 
    race.Asian = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("race.White", 
"race.Hispanic", "race.Black", "race.Asian"), row.names = c(NA, 
-10L), class = "data.frame")

What I tried:

This is a possible solution but I am sure there's a better indexing/dplyr/data.table/.etc solution.

apply(dat, 1, function(x) sub("[^.]+\\.", "", colnames(dat))[x])

243

asked Sep 16 '15 02:09

Tyler Rinker

2 Answers

We can use max.col to get the column index, subset the column names based on that and use sub to remove the prefix.

sub('[^.]+\\.', '', names(dat)[max.col(dat)])
#[1] "White"    "Asian"    "White"    "Black"    "Asian"    "Hispanic"
#[7] "White"    "White"    "White"    "Black"

Here, I assumed that there is a single 1 per each row. If there are multiple 1s, we can use the option ties.method='first' or ties.method='last'.

Or another option is doing the %*% with the sequence of columns, subset the column names, and remove the prefix with sub.

 sub('[^.]+\\.', '', names(dat)[(as.matrix(dat) %*%seq_along(dat))[,1]])

Or we can use pmax

sub('[^.]+\\.', '', names(dat)[do.call(pmax,dat*seq_along(dat)[col(dat)])])

111

answered Sep 21 '22 15:09

akrun

Another idea:

ff = function(x)
{
    ans = integer(nrow(x))
    for(i in seq_along(x)) ans[as.logical(x[[i]])] = i
    names(x)[ans]
}                                    
sub("[^.]+\\.", "", ff(dat))
#[1] "White"    "Asian"    "White"    "Black"    "Asian"    "Hispanic" "White"    "White"    "White"    "Black"

And to compare with akrun's alternatives:

akrun1 = function(x) names(x)[max.col(x, "first")]
akrun2 = function(x) names(x)[(as.matrix(x) %*% seq_along(x))[, 1]]
akrun3 = function(x) names(x)[do.call(pmax, x * seq_along(x)[col(x)])]
akrunlike = function(x) names(x)[do.call(pmax, Map("*", x, seq_along(x)))]

DF = setNames(as.data.frame("[<-"(matrix(0L, 1e4, 1e3), 
                                  cbind(seq_len(1e4), sample(1e3, 1e4, TRUE)), 
                                  1L)), 
              paste("fac", 1:1e3, sep = ""))

identical(ff(DF), akrun1(DF))
#[1] TRUE
identical(ff(DF), akrun2(DF))
#[1] TRUE
identical(ff(DF), akrun3(DF))
#[1] TRUE
identical(ff(DF), akrunlike(DF))
#[1] TRUE
microbenchmark::microbenchmark(ff(DF), akrun1(DF), akrun2(DF), 
                               akrun3(DF), akrunlike(DF), 
                               as.matrix(DF), col(DF), times = 30)
#Unit: milliseconds
#          expr        min         lq     median         uq        max neval
#        ff(DF)   61.99124   64.56194   78.62267  102.18424  152.64891    30
#    akrun1(DF)  296.89042  314.28641  327.95059  353.46185  394.46013    30
#    akrun2(DF)  103.76105  114.01497  120.12191  129.86513  166.13266    30
#    akrun3(DF) 1141.46478 1163.96842 1178.92961 1203.83848 1231.70346    30
# akrunlike(DF)  125.47542  130.20826  141.66123  157.92743  203.42331    30
# as.matrix(DF)   19.46940   20.54543   28.22377   35.69575   87.06001    30
#       col(DF)  103.61454  112.75450  116.00120  126.09138  176.97435    30

I included as.matrix() and col() just to show that "list"-y structures can be convenient on efficient looping as is. E.g., in contrast to a by-row looping, a way to use by-column looping doesn't need time to transform the structure of data.

answered Sep 21 '22 15:09

alexis_laz

Related questions
                            
                                Use lapply to plot data in a list and use names of list elements as plot titles [duplicate]
                            
                                Wrapping custom notes in texreg output
                            
                                RPostgreSQL connections are expired as soon as they are initiated with doParallel clusterEvalQ
                            
                                par(mfrow=c(1,2)) not displaying side-by-side densityplots [duplicate]
                            
                                ggvis - Interactive X axis for bar chart
                            
                                Combining rows based on the id in R
                            
                                R datatable rowCallback with DT
                            
                                how to select columns from R dataframe in rpy2 in python?
                            
                                Random Forest Crossvalidation in R
                            
                                in R: Error in is.data.frame(data) : object '' not found, C5.0 plot
                            
                                Function that extracts each unique character in a string
                            
                                ggplot2: Plotting regression lines with different intercepts but with same slope
                            
                                Odd behavior when joining with multiple conditions
                            
                                Rcpp Create DataFrame with Variable Number of Columns
                            
                                R convert list of lists to dataframe
                            
                                How to create a decision boundary graph for kNN models in the Caret package?
                            
                                Filling list with empty vectors causes its length to change
                            
                                R shiny: color fileInput button and progress bar
                            
                                Group Data in R for consecutive rows
                            
                                Update matrix using matrix of indices in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With